Modified discrete cosine transform

Last updated

The modified discrete cosine transform (MDCT) is a transform based on the type-IV discrete cosine transform (DCT-IV), with the additional property of being lapped: it is designed to be performed on consecutive blocks of a larger dataset, where subsequent blocks are overlapped so that the last half of one block coincides with the first half of the next block. This overlapping, in addition to the energy-compaction qualities of the DCT, makes the MDCT especially attractive for signal compression applications, since it helps to avoid artifacts stemming from the block boundaries. As a result of these advantages, the MDCT is the most widely used lossy compression technique in audio data compression. It is employed in most modern audio coding standards, including MP3, Dolby Digital (AC-3), Vorbis (Ogg), Windows Media Audio (WMA), ATRAC, Cook, Advanced Audio Coding (AAC), [1] High-Definition Coding (HDC), [2] LDAC, Dolby AC-4, [3] and MPEG-H 3D Audio, [4] as well as speech coding standards such as AAC-LD (LD-MDCT), [5] G.722.1, [6] G.729.1, [7] CELT, [8] and Opus. [9] [10]

Contents

The discrete cosine transform (DCT) was first proposed by Nasir Ahmed in 1972, [11] and demonstrated by Ahmed with T. Natarajan and K. R. Rao in 1974. [12] The MDCT was later proposed by John P. Princen, A.W. Johnson and Alan B. Bradley at the University of Surrey in 1987, [13] following earlier work by Princen and Bradley (1986) [14] to develop the MDCT's underlying principle of time-domain aliasing cancellation (TDAC), described below. (There also exists an analogous transform, the MDST, based on the discrete sine transform, as well as other, rarely used, forms of the MDCT based on different types of DCT or DCT/DST combinations.)

In MP3, the MDCT is not applied to the audio signal directly, but rather to the output of a 32-band polyphase quadrature filter (PQF) bank. The output of this MDCT is postprocessed by an alias reduction formula to reduce the typical aliasing of the PQF filter bank. Such a combination of a filter bank with an MDCT is called a hybrid filter bank or a subband MDCT. AAC, on the other hand, normally uses a pure MDCT; only the (rarely used) MPEG-4 AAC-SSR variant (by Sony) uses a four-band PQF bank followed by an MDCT. Similar to MP3, ATRAC uses stacked quadrature mirror filters (QMF) followed by an MDCT.

Definition

As a lapped transform, the MDCT is a bit unusual compared to other Fourier-related transforms in that it has half as many outputs as inputs (instead of the same number). In particular, it is a linear function (where R denotes the set of real numbers). The 2N real numbers x0, ..., x2N-1 are transformed into the N real numbers X0, ..., XN-1 according to the formula:

(The normalization coefficient in front of this transform, here unity, is an arbitrary convention and differs between treatments. Only the product of the normalizations of the MDCT and the IMDCT, below, is constrained.)

Inverse transform

The inverse MDCT is known as the IMDCT. Because there are different numbers of inputs and outputs, at first glance it might seem that the MDCT should not be invertible. However, perfect invertibility is achieved by adding the overlapped IMDCTs of subsequent overlapping blocks, causing the errors to cancel and the original data to be retrieved; this technique is known as time-domain aliasing cancellation (TDAC).

The IMDCT transforms N real numbers X0, ..., XN-1 into 2N real numbers y0, ..., y2N-1 according to the formula:

(Like for the DCT-IV, an orthogonal transform, the inverse has the same form as the forward transform.)

In the case of a windowed MDCT with the usual window normalization (see below), the normalization coefficient in front of the IMDCT should be multiplied by 2 (i.e., becoming 2/N).

Computation

Although the direct application of the MDCT formula would require O(N2) operations, it is possible to compute the same thing with only O(N log N) complexity by recursively factorizing the computation, as in the fast Fourier transform (FFT). One can also compute MDCTs via other transforms, typically a DFT (FFT) or a DCT, combined with O(N) pre- and post-processing steps. Also, as described below, any algorithm for the DCT-IV immediately provides a method to compute the MDCT and IMDCT of even size.

Window functions

MDCT window functions:
blue: Cosine, red: Sine-Cosine, green: modified Kaiser-Bessel MDCT WF.png
MDCT window functions:
blue: Cosine, red: Sine-Cosine, green: modified Kaiser-Bessel

In typical signal-compression applications, the transform properties are further improved by using a window function wn (n = 0, ..., 2N−1) that is multiplied with xn in the MDCT and with yn in the IMDCT formulas, above, in order to avoid discontinuities at the n = 0 and 2N boundaries by making the function go smoothly to zero at those points. (That is, the window function is applied to the data before the MDCT or after the IMDCT.) In principle, x and y could have different window functions, and the window function could also change from one block to the next (especially for the case where data blocks of different sizes are combined), but for simplicity we consider the common case of identical window functions for equal-sized blocks.

The transform remains invertible (that is, TDAC works), for a symmetric window wn = w2N−1−n, as long as w satisfies the Princen-Bradley condition:

.

Various window functions are used. A window that produces a form known as a modulated lapped transform (MLT) [15] [16] is given by

and is used for MP3 and MPEG-2 AAC, and

for Vorbis. AC-3 uses a Kaiser-Bessel derived (KBD) window, and MPEG-4 AAC can also use a KBD window.

Note that windows applied to the MDCT are different from windows used for some other types of signal analysis, since they must fulfill the Princen-Bradley condition. One of the reasons for this difference is that MDCT windows are applied twice, for both the MDCT (analysis) and the IMDCT (synthesis).

Relationship to DCT-IV and Origin of TDAC

As can be seen by inspection of the definitions, for evenN the MDCT is essentially equivalent to a DCT-IV, where the input is shifted by N/2 and two N-blocks of data are transformed at once. By examining this equivalence more carefully, important properties like TDAC can be easily derived.

In order to define the precise relationship to the DCT-IV, one must realize that the DCT-IV corresponds to alternating even/odd boundary conditions: even at its left boundary (around n = 1/2), odd at its right boundary (around n = N  1/2), and so on (instead of periodic boundaries as for a DFT). This follows from the identities and . Thus, if its inputs are an array x of length N, we can imagine extending this array to (x, xR, x, xR, ...) and so on, where xR denotes x in reverse order.

Consider an MDCT with 2N inputs and N outputs, where we divide the inputs into four blocks (a, b, c, d) each of size N/2. If we shift these to the right by N/2 (from the +N/2 term in the MDCT definition), then (b, c, d) extend past the end of the N DCT-IV inputs, so we must "fold" them back according to the boundary conditions described above.

Thus, the MDCT of 2N inputs (a, b, c, d) is exactly equivalent to a DCT-IV of the N inputs: (cRd, abR), where R denotes reversal as above.

(In this way, any algorithm to compute the DCT-IV can be trivially applied to the MDCT.)

Similarly, the IMDCT formula above is precisely 1/2 of the DCT-IV (which is its own inverse), where the output is extended (via the boundary conditions) to a length 2N and shifted back to the left by N/2. The inverse DCT-IV would simply give back the inputs (cRd, abR) from above. When this is extended via the boundary conditions and shifted, one obtains:

IMDCT (MDCT (a, b, c, d)) = (abR, baR, c+dR, d+cR) / 2.

Half of the IMDCT outputs are thus redundant, as baR = (abR)R, and likewise for the last two terms. If we group the input into bigger blocks A,B of size N, where A = (a, b) and B = (c, d), we can write this result in a simpler way:

IMDCT (MDCT (A, B)) = (AAR, B+BR) / 2

One can now understand how TDAC works. Suppose that one computes the MDCT of the subsequent, 50% overlapped, 2N block (B, C). The IMDCT will then yield, analogous to the above: (BBR, C+CR) / 2. When this is added with the previous IMDCT result in the overlapping half, the reversed terms cancel and one obtains simply B, recovering the original data.

Origin of TDAC

The origin of the term "time-domain aliasing cancellation" is now clear. The use of input data that extend beyond the boundaries of the logical DCT-IV causes the data to be aliased in the same way that frequencies beyond the Nyquist frequency are aliased to lower frequencies, except that this aliasing occurs in the time domain instead of the frequency domain: we cannot distinguish the contributions of a and of bR to the MDCT of (a, b, c, d), or equivalently, to the result of

IMDCT (MDCT (a, b, c, d)) = (abR, baR, c+dR, d+cR) / 2.

The combinations cdR and so on, have precisely the right signs for the combinations to cancel when they are added.

For oddN (which are rarely used in practice), N/2 is not an integer so the MDCT is not simply a shift permutation of a DCT-IV. In this case, the additional shift by half a sample means that the MDCT/IMDCT becomes equivalent to the DCT-III/II, and the analysis is analogous to the above.

Smoothness and discontinuities

We have seen above that the MDCT of 2N inputs (a, b, c, d) is equivalent to a DCT-IV of the N inputs (cRd, abR). The DCT-IV is designed for the case where the function at the right boundary is odd, and therefore the values near the right boundary are close to 0. If the input signal is smooth, this is the case: the rightmost components of a and bR are consecutive in the input sequence (a, b, c, d), and therefore their difference is small. Let us look at the middle of the interval: if we rewrite the above expression as (cRd, abR) = (d, a)(b,c)R, the second term, (b,c)R, gives a smooth transition in the middle. However, in the first term, (d, a), there is a potential discontinuity where the right end of d meets the left end of a. This is the reason for using a window function that reduces the components near the boundaries of the input sequence (a, b, c, d) towards 0.

TDAC for the windowed MDCT

Above, the TDAC property was proved for the ordinary MDCT, showing that adding IMDCTs of subsequent blocks in their overlapping half recovers the original data. The derivation of this inverse property for the windowed MDCT is only slightly more complicated.

Consider to overlapping consecutive sets of 2N inputs (A,B) and (B,C), for blocks A,B,C of size N. Recall from above that when and are MDCTed, IMDCTed, and added in their overlapping half, we obtain , the original data.

Now we suppose that we multiply both the MDCT inputs and the IMDCT outputs by a window function of length 2N. As above, we assume a symmetric window function, which is therefore of the form where W is a length-N vector and R denotes reversal as before. Then the Princen-Bradley condition can be written as , with the squares and additions performed elementwise.

Therefore, instead of MDCTing , we now MDCT (with all multiplications performed elementwise). When this is IMDCTed and multiplied again (elementwise) by the window function, the last-N half becomes:

.

(Note that we no longer have the multiplication by 1/2, because the IMDCT normalization differs by a factor of 2 in the windowed case.)

Similarly, the windowed MDCT and IMDCT of yields, in its first-N half:

.

When we add these two halves together, we obtain:

recovering the original data.

See also

Related Research Articles

<span class="mw-page-title-main">Trigonometric functions</span> Functions of an angle

In mathematics, the trigonometric functions are real functions which relate an angle of a right-angled triangle to ratios of two side lengths. They are widely used in all sciences that are related to geometry, such as navigation, solid mechanics, celestial mechanics, geodesy, and many others. They are among the simplest periodic functions, and as such are also widely used for studying periodic phenomena through Fourier analysis.

<span class="mw-page-title-main">Hyperbolic functions</span> Collective name of 6 mathematical functions

In mathematics, hyperbolic functions are analogues of the ordinary trigonometric functions, but defined using the hyperbola rather than the circle. Just as the points (cos t, sin t) form a circle with a unit radius, the points (cosh t, sinh t) form the right half of the unit hyperbola. Also, similarly to how the derivatives of sin(t) and cos(t) are cos(t) and –sin(t) respectively, the derivatives of sinh(t) and cosh(t) are cosh(t) and +sinh(t) respectively.

A discrete cosine transform (DCT) expresses a finite sequence of data points in terms of a sum of cosine functions oscillating at different frequencies. The DCT, first proposed by Nasir Ahmed in 1972, is a widely used transformation technique in signal processing and data compression. It is used in most digital media, including digital images, digital video, digital audio, digital television, digital Analógico radio, and speech coding. DCTs are also important to numerous other applications in science and engineering, such as digital signal processing, telecommunication devices, reducing network bandwidth usage, and spectral methods for the numerical solution of partial differential equations.

In mathematics, the discrete sine transform (DST) is a Fourier-related transform similar to the discrete Fourier transform (DFT), but using a purely real matrix. It is equivalent to the imaginary parts of a DFT of roughly twice the length, operating on real data with odd symmetry (since the Fourier transform of a real and odd function is imaginary and odd), where in some variants the input and/or output data are shifted by half a sample.

<span class="mw-page-title-main">Kaiser window</span> Used in finite impulse response filter design and spectral analysis

The Kaiser window, also known as the Kaiser–Bessel window, was developed by James Kaiser at Bell Laboratories. It is a one-parameter family of window functions used in finite impulse response filter design and spectral analysis. The Kaiser window approximates the DPSS window which maximizes the energy concentration in the main lobe but which is difficult to compute.

<span class="mw-page-title-main">Window function</span> Function used in signal processing

In signal processing and statistics, a window function is a mathematical function that is zero-valued outside of some chosen interval. Typically, windows functions are symmetric around the middle of the interval, approach a maximum in the middle, and taper away from the middle. Mathematically, when another function or waveform/data-sequence is "multiplied" by a window function, the product is also zero-valued outside the interval: all that is left is the part where they overlap, the "view through the window". Equivalently, and in actual practice, the segment of data within the window is first isolated, and then only that data is multiplied by the window function values. Thus, tapering, not segmentation, is the main purpose of window functions.

In sound processing, the mel-frequency cepstrum (MFC) is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency.

<span class="mw-page-title-main">Inverse trigonometric functions</span> Inverse functions of sin, cos, tan, etc.

In mathematics, the inverse trigonometric functions are the inverse functions of the trigonometric functions. Specifically, they are the inverses of the sine, cosine, tangent, cotangent, secant, and cosecant functions, and are used to obtain an angle from any of the angle's trigonometric ratios. Inverse trigonometric functions are widely used in engineering, navigation, physics, and geometry.

<span class="mw-page-title-main">Butterworth filter</span> Type of signal processing filter

The Butterworth filter is a type of signal processing filter designed to have a frequency response that is as flat as possible in the passband. It is also referred to as a maximally flat magnitude filter. It was first described in 1930 by the British engineer and physicist Stephen Butterworth in his paper entitled "On the Theory of Filter Amplifiers".

The Pythagorean trigonometric identity, also called simply the Pythagorean identity, is an identity expressing the Pythagorean theorem in terms of trigonometric functions. Along with the sum-of-angles formulae, it is one of the basic relations between the sine and cosine functions.

<span class="mw-page-title-main">Lemniscate elliptic functions</span> Mathematical functions

In mathematics, the lemniscate elliptic functions are elliptic functions related to the arc length of the lemniscate of Bernoulli. They were first studied by Giulio Fagnano in 1718 and later by Leonhard Euler and Carl Friedrich Gauss, among others.

<span class="mw-page-title-main">Sine and cosine</span> Fundamental trigonometric functions

In mathematics, sine and cosine are trigonometric functions of an angle. The sine and cosine of an acute angle are defined in the context of a right triangle: for the specified angle, its sine is the ratio of the length of the side that is opposite that angle to the length of the longest side of the triangle, and the cosine is the ratio of the length of the adjacent leg to that of the hypotenuse. For an angle , the sine and cosine functions are denoted simply as and .

Clenshaw–Curtis quadrature and Fejér quadrature are methods for numerical integration, or "quadrature", that are based on an expansion of the integrand in terms of Chebyshev polynomials. Equivalently, they employ a change of variables and use a discrete cosine transform (DCT) approximation for the cosine series. Besides having fast-converging accuracy comparable to Gaussian quadrature rules, Clenshaw–Curtis quadrature naturally leads to nested quadrature rules, which is important for both adaptive quadrature and multidimensional quadrature (cubature).

In the 1760s, Johann Heinrich Lambert was the first to prove that the number π is irrational, meaning it cannot be expressed as a fraction , where and are both integers. In the 19th century, Charles Hermite found a proof that requires no prerequisite knowledge beyond basic calculus. Three simplifications of Hermite's proof are due to Mary Cartwright, Ivan Niven, and Nicolas Bourbaki. Another proof, which is a simplification of Lambert's proof, is due to Miklós Laczkovich. Many of these are proofs by contradiction.

In digital image and video processing, a color layout descriptor (CLD) is designed to capture the spatial distribution of color in an image. The feature extraction process consists of two parts: grid based representative color selection and discrete cosine transform with quantization.

In applied mathematics, a discrete Chebyshev transform (DCT) is an analog of the discrete Fourier transform for a function of a real interval, converting in either direction between function values at a set of Chebyshev nodes and coefficients of a function in Chebyshev polynomial basis. Like the Chebyshev polynomials, it is named after Pafnuty Chebyshev.

In mathematical analysis and applications, multidimensional transforms are used to analyze the frequency content of signals in a domain of two or more dimensions.

<span class="mw-page-title-main">Audio coding format</span> Digitally coded format for audio signals

An audio coding format is a content representation format for storage or transmission of digital audio. Examples of audio coding formats include MP3, AAC, Vorbis, FLAC, and Opus. A specific software or hardware implementation capable of audio compression and decompression to/from a specific audio coding format is called an audio codec; an example of an audio codec is LAME, which is one of several different codecs which implements encoding and decoding audio in the MP3 audio coding format in software.

Multiresolution Fourier Transform is an integral fourier transform that represents a specific wavelet-like transform with a fully scalable modulated window, but not all possible translations.

References

  1. Luo, Fa-Long (2008). Mobile Multimedia Broadcasting Standards: Technology and Practice. Springer Science & Business Media. p. 590. ISBN   9780387782638.
  2. Jones, Graham A.; Layer, David H.; Osenkowsky, Thomas G. (2013). National Association of Broadcasters Engineering Handbook: NAB Engineering Handbook. Taylor & Francis. pp. 558–9. ISBN   978-1-136-03410-7.
  3. "Dolby AC-4: Audio Delivery for Next-Generation Entertainment Services" (PDF). Dolby Laboratories . June 2015. Retrieved 11 November 2019.
  4. Bleidt, R. L.; Sen, D.; Niedermeier, A.; Czelhan, B.; Füg, S.; et al. (2017). "Development of the MPEG-H TV Audio System for ATSC 3.0" (PDF). IEEE Transactions on Broadcasting. 63 (1): 202–236. doi:10.1109/TBC.2017.2661258. S2CID   30821673.
  5. Schnell, Markus; Schmidt, Markus; Jander, Manuel; Albert, Tobias; Geiger, Ralf; Ruoppila, Vesa; Ekstrand, Per; Bernhard, Grill (October 2008). MPEG-4 Enhanced Low Delay AAC - A New Standard for High Quality Communication (PDF). 125th AES Convention. Fraunhofer IIS . Audio Engineering Society . Retrieved 20 October 2019.
  6. Lutzky, Manfred; Schuller, Gerald; Gayer, Marc; Krämer, Ulrich; Wabnik, Stefan (May 2004). A guideline to audio codec delay (PDF). 116th AES Convention. Fraunhofer IIS . Audio Engineering Society . Retrieved 24 October 2019.
  7. Nagireddi, Sivannarayana (2008). VoIP Voice and Fax Signal Processing. John Wiley & Sons. p. 69. ISBN   9780470377864.
  8. Presentation of the CELT codec Archived 2011-08-07 at the Wayback Machine by Timothy B. Terriberry (65 minutes of video, see also presentation slides Archived 2023-11-16 at the Wayback Machine in PDF)
  9. "Opus Codec". Opus (Home page). Xiph.org Foundation. Retrieved July 31, 2012.
  10. Bright, Peter (2012-09-12). "Newly standardized Opus audio codec fills every role from online chat to music". Ars Technica . Retrieved 2014-05-28.
  11. Ahmed, Nasir (January 1991). "How I Came Up With the Discrete Cosine Transform" (PDF). Digital Signal Processing . 1 (1): 4–5. doi:10.1016/1051-2004(91)90086-Z.
  12. Ahmed, Nasir; Natarajan, T.; Rao, K. R. (January 1974), "Discrete Cosine Transform", IEEE Transactions on Computers, C-23 (1): 90–93, doi:10.1109/T-C.1974.223784, S2CID   149806273
  13. Princen, John P.; Johnson, A.W.; Bradley, Alan B. (1987). "Subband/Transform coding using filter bank designs based on time domain aliasing cancellation". ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing. Vol. 12. pp. 2161–2164. doi:10.1109/ICASSP.1987.1169405. S2CID   58446992.
  14. John P. Princen, Alan B. Bradley: Analysis/synthesis filter bank design based on time domain aliasing cancellation, IEEE Trans. Acoust. Speech Signal Processing, ASSP-34 (5), 1153–1161, 1986. Described a precursor to the MDCT using a combination of discrete cosine and sine transforms.
  15. H. S. Malvar, "Lapped Transforms for Efficient Transform/Subband Coding", IEEE Trans. on Acoustics, Speech, and Signal Processing, vol. 38, no. 6, pp. 969–978 (Equation 22), June 1990.
  16. H. S. Malvar, "Modulated QMF Filter Banks with Perfect Reconstruction", Electronics Letters, vol. 26, no. 13, pp. 906–907 (Equation 13), June 1990.

Bibliography