# Quantization (signal processing)

Last updated

Quantization, in mathematics and digital signal processing, is the process of mapping input values from a large set (often a continuous set) to output values in a (countable) smaller set, often with a finite number of elements. Rounding and truncation are typical examples of quantization processes. Quantization is involved to some degree in nearly all digital signal processing, as the process of representing a signal in digital form ordinarily involves rounding. Quantization also forms the core of essentially all lossy compression algorithms.

## Contents

The difference between an input value and its quantized value (such as round-off error) is referred to as quantization error. A device or algorithmic function that performs quantization is called a quantizer. An analog-to-digital converter is an example of a quantizer.

## Example

For example, rounding a real number ${\displaystyle x}$ to the nearest integer value forms a very basic type of quantizer – a uniform one. A typical (mid-tread) uniform quantizer with a quantization step size equal to some value ${\displaystyle \Delta }$ can be expressed as

${\displaystyle Q(x)=\Delta \cdot \left\lfloor {\frac {x}{\Delta }}+{\frac {1}{2}}\right\rfloor }$,

where the notation ${\displaystyle \lfloor \ \rfloor }$ denotes the floor function.

The essential property of a quantizer is having a countable-set of possible output-values members smaller than the set of possible input values. The members of the set of output values may have integer, rational, or real values. For simple rounding to the nearest integer, the step size ${\displaystyle \Delta }$ is equal to 1. With ${\displaystyle \Delta =1}$ or with ${\displaystyle \Delta }$ equal to any other integer value, this quantizer has real-valued inputs and integer-valued outputs.

When the quantization step size (Δ) is small relative to the variation in the signal being quantized, it is relatively simple to show that the mean squared error produced by such a rounding operation will be approximately ${\displaystyle \Delta ^{2}/12}$. [1] [2] [3] [4] [5] [6] Mean squared error is also called the quantization noise power. Adding one bit to the quantizer halves the value of Δ, which reduces the noise power by the factor ¼. In terms of decibels, the noise power change is ${\displaystyle \scriptstyle 10\cdot \log _{10}(1/4)\ \approx \ -6\ \mathrm {dB} .}$

Because the set of possible output values of a quantizer is countable, any quantizer can be decomposed into two distinct stages, which can be referred to as the classification stage (or forward quantization stage) and the reconstruction stage (or inverse quantization stage), where the classification stage maps the input value to an integer quantization index${\displaystyle k}$ and the reconstruction stage maps the index ${\displaystyle k}$ to the reconstruction value${\displaystyle y_{k}}$ that is the output approximation of the input value. For the example uniform quantizer described above, the forward quantization stage can be expressed as

${\displaystyle k=\left\lfloor {\frac {x}{\Delta }}+{\frac {1}{2}}\right\rfloor }$,

and the reconstruction stage for this example quantizer is simply

${\displaystyle y_{k}=k\cdot \Delta }$.

This decomposition is useful for the design and analysis of quantization behavior, and it illustrates how the quantized data can be communicated over a communication channel – a source encoder can perform the forward quantization stage and send the index information through a communication channel, and a decoder can perform the reconstruction stage to produce the output approximation of the original input data. In general, the forward quantization stage may use any function that maps the input data to the integer space of the quantization index data, and the inverse quantization stage can conceptually (or literally) be a table look-up operation to map each quantization index to a corresponding reconstruction value. This two-stage decomposition applies equally well to vector as well as scalar quantizers.

## Mathematical properties

Because quantization is a many-to-few mapping, it is an inherently non-linear and irreversible process (i.e., because the same output value is shared by multiple input values, it is impossible, in general, to recover the exact input value when given only the output value).

The set of possible input values may be infinitely large, and may possibly be continuous and therefore uncountable (such as the set of all real numbers, or all real numbers within some limited range). The set of possible output values may be finite or countably infinite. [6] The input and output sets involved in quantization can be defined in a rather general way. For example, vector quantization is the application of quantization to multi-dimensional (vector-valued) input data. [7]

## Types

### Analog-to-digital converter

An analog-to-digital converter (ADC) can be modeled as two processes: sampling and quantization. Sampling converts a time-varying voltage signal into a discrete-time signal, a sequence of real numbers. Quantization replaces each real number with an approximation from a finite set of discrete values. Most commonly, these discrete values are represented as fixed-point words. Though any number of quantization levels is possible, common word-lengths are 8-bit (256 levels), 16-bit (65,536 levels) and 24-bit (16.8 million levels). Quantizing a sequence of numbers produces a sequence of quantization errors which is sometimes modeled as an additive random signal called quantization noise because of its stochastic behavior. The more levels a quantizer uses, the lower is its quantization noise power.

### Rate–distortion optimization

Rate–distortion optimized quantization is encountered in source coding for lossy data compression algorithms, where the purpose is to manage distortion within the limits of the bit rate supported by a communication channel or storage medium. The analysis of quantization in this context involves studying the amount of data (typically measured in digits or bits or bit rate) that is used to represent the output of the quantizer, and studying the loss of precision that is introduced by the quantization process (which is referred to as the distortion).

### Mid-riser and mid-tread uniform quantizers

Most uniform quantizers for signed input data can be classified as being of one of two types: mid-riser and mid-tread. The terminology is based on what happens in the region around the value 0, and uses the analogy of viewing the input-output function of the quantizer as a stairway. Mid-tread quantizers have a zero-valued reconstruction level (corresponding to a tread of a stairway), while mid-riser quantizers have a zero-valued classification threshold (corresponding to a riser of a stairway). [9]

Mid-tread quantization involves rounding. The formulas for mid-tread uniform quantization are provided in the previous section.

Mid-riser quantization involves truncation. The input-output formula for a mid-riser uniform quantizer is given by:

${\displaystyle Q(x)=\Delta \cdot \left(\left\lfloor {\frac {x}{\Delta }}\right\rfloor +{\frac {1}{2}}\right)}$,

where the classification rule is given by

${\displaystyle k=\left\lfloor {\frac {x}{\Delta }}\right\rfloor }$

and the reconstruction rule is

${\displaystyle y_{k}=\Delta \cdot \left(k+{\tfrac {1}{2}}\right)}$.

Note that mid-riser uniform quantizers do not have a zero output value – their minimum output magnitude is half the step size. In contrast, mid-tread quantizers do have a zero output level. For some applications, having a zero output signal representation may be a necessity.

In general, a mid-riser or mid-tread quantizer may not actually be a uniform quantizer – i.e., the size of the quantizer's classification intervals may not all be the same, or the spacing between its possible output values may not all be the same. The distinguishing characteristic of a mid-riser quantizer is that it has a classification threshold value that is exactly zero, and the distinguishing characteristic of a mid-tread quantizer is that is it has a reconstruction value that is exactly zero. [9]

A dead-zone quantizer is a type of mid-tread quantizer with symmetric behavior around 0. The region around the zero output value of such a quantizer is referred to as the dead zone or deadband . The dead zone can sometimes serve the same purpose as a noise gate or squelch function. Especially for compression applications, the dead-zone may be given a different width than that for the other steps. For an otherwise-uniform quantizer, the dead-zone width can be set to any value ${\displaystyle w}$ by using the forward quantization rule [10] [11] [12]

${\displaystyle k=\operatorname {sgn}(x)\cdot \max \left(0,\left\lfloor {\frac {\left|x\right|-w/2}{\Delta }}+1\right\rfloor \right)}$,

where the function ${\displaystyle \operatorname {sgn} }$( ) is the sign function (also known as the signum function). The general reconstruction rule for such a dead-zone quantizer is given by

${\displaystyle y_{k}=\operatorname {sgn}(k)\cdot \left({\frac {w}{2}}+\Delta \cdot (|k|-1+r_{k})\right)}$,

where ${\displaystyle r_{k}}$ is a reconstruction offset value in the range of 0 to 1 as a fraction of the step size. Ordinarily, ${\displaystyle 0\leq r_{k}\leq {\tfrac {1}{2}}}$ when quantizing input data with a typical probability density function (PDF) that is symmetric around zero and reaches its peak value at zero (such as a Gaussian, Laplacian, or generalized Gaussian PDF). Although ${\displaystyle r_{k}}$ may depend on ${\displaystyle k}$ in general, and can be chosen to fulfill the optimality condition described below, it is often simply set to a constant, such as ${\displaystyle {\tfrac {1}{2}}}$. (Note that in this definition, ${\displaystyle y_{0}=0}$ due to the definition of the ${\displaystyle \operatorname {sgn} }$( ) function, so ${\displaystyle r_{0}}$ has no effect.)

A very commonly used special case (e.g., the scheme typically used in financial accounting and elementary mathematics) is to set ${\displaystyle w=\Delta }$ and ${\displaystyle r_{k}={\tfrac {1}{2}}}$ for all ${\displaystyle k}$. In this case, the dead-zone quantizer is also a uniform quantizer, since the central dead-zone of this quantizer has the same width as all of its other steps, and all of its reconstruction values are equally spaced as well.

## Noise and error characteristics

A common assumption for the analysis of quantization error is that it affects a signal processing system in a similar manner to that of additive white noise – having negligible correlation with the signal and an approximately flat power spectral density. [2] [6] [13] [14] The additive noise model is commonly used for the analysis of quantization error effects in digital filtering systems, and it can be very useful in such analysis. It has been shown to be a valid model in cases of high resolution quantization (small ${\displaystyle \Delta }$ relative to the signal strength) with smooth PDFs. [2] [15]

Additive noise behavior is not always a valid assumption. Quantization error (for quantizers defined as described here) is deterministically related to the signal and not entirely independent of it. Thus, periodic signals can create periodic quantization noise. And in some cases it can even cause limit cycles to appear in digital signal processing systems. One way to ensure effective independence of the quantization error from the source signal is to perform dithered quantization (sometimes with noise shaping ), which involves adding random (or pseudo-random) noise to the signal prior to quantization. [6] [14]

### Quantization error models

In the typical case, the original signal is much larger than one least significant bit (LSB). When this is the case, the quantization error is not significantly correlated with the signal, and has an approximately uniform distribution. When rounding is used to quantize, the quantization error has a mean of zero and the root mean square (RMS) value is the standard deviation of this distribution, given by ${\displaystyle \scriptstyle {\frac {1}{\sqrt {12}}}\mathrm {LSB} \ \approx \ 0.289\,\mathrm {LSB} }$. When truncation is used, the error has a non-zero mean of ${\displaystyle \scriptstyle {\frac {1}{2}}\mathrm {LSB} }$ and the RMS value is ${\displaystyle \scriptstyle {\frac {1}{\sqrt {3}}}\mathrm {LSB} }$. Although rounding yields less RMS error than truncation, the difference is only due to the static (DC) term of ${\displaystyle \scriptstyle {\frac {1}{2}}\mathrm {LSB} }$. The RMS values of the AC error are exactly the same in both cases, so there is no special advantage of rounding over truncation in situations where the DC term of the error can be ignored (such as in AC coupled systems). In either case, the standard deviation, as a percentage of the full signal range, changes by a factor of 2 for each 1-bit change in the number of quantization bits. The potential signal-to-quantization-noise power ratio therefore changes by 4, or ${\displaystyle \scriptstyle 10\cdot \log _{10}(4)}$, approximately 6 dB per bit.

At lower amplitudes the quantization error becomes dependent on the input signal, resulting in distortion. This distortion is created after the anti-aliasing filter, and if these distortions are above 1/2 the sample rate they will alias back into the band of interest. In order to make the quantization error independent of the input signal, the signal is dithered by adding noise to the signal. This slightly reduces signal to noise ratio, but can completely eliminate the distortion.

### Quantization noise model

Quantization noise is a model of quantization error introduced by quantization in the ADC. It is a rounding error between the analog input voltage to the ADC and the output digitized value. The noise is non-linear and signal-dependent. It can be modelled in several different ways.

In an ideal ADC, where the quantization error is uniformly distributed between −1/2 LSB and +1/2 LSB, and the signal has a uniform distribution covering all quantization levels, the Signal-to-quantization-noise ratio (SQNR) can be calculated from

${\displaystyle \mathrm {SQNR} =20\log _{10}(2^{Q})\approx 6.02\cdot Q\ \mathrm {dB} \,\!}$

where Q is the number of quantization bits.

The most common test signals that fulfill this are full amplitude triangle waves and sawtooth waves.

For example, a 16-bit ADC has a maximum signal-to-quantization-noise ratio of 6.02 × 16 = 96.3 dB.

When the input signal is a full-amplitude sine wave the distribution of the signal is no longer uniform, and the corresponding equation is instead

${\displaystyle \mathrm {SQNR} \approx 1.761+6.02\cdot Q\ \mathrm {dB} \,\!}$

Here, the quantization noise is once again assumed to be uniformly distributed. When the input signal has a high amplitude and a wide frequency spectrum this is the case. [16] In this case a 16-bit ADC has a maximum signal-to-noise ratio of 98.09 dB. The 1.761 difference in signal-to-noise only occurs due to the signal being a full-scale sine wave instead of a triangle or sawtooth.

For complex signals in high-resolution ADCs this is an accurate model. For low-resolution ADCs, low-level signals in high-resolution ADCs, and for simple waveforms the quantization noise is not uniformly distributed, making this model inaccurate. [17] In these cases the quantization noise distribution is strongly affected by the exact amplitude of the signal.

The calculations are relative to full-scale input. For smaller signals, the relative quantization distortion can be very large. To circumvent this issue, analog companding can be used, but this can introduce distortion.

## Design

### Granular distortion and overload distortion

Often the design of a quantizer involves supporting only a limited range of possible output values and performing clipping to limit the output to this range whenever the input exceeds the supported range. The error introduced by this clipping is referred to as overload distortion. Within the extreme limits of the supported range, the amount of spacing between the selectable output values of a quantizer is referred to as its granularity, and the error introduced by this spacing is referred to as granular distortion. It is common for the design of a quantizer to involve determining the proper balance between granular distortion and overload distortion. For a given supported number of possible output values, reducing the average granular distortion may involve increasing the average overload distortion, and vice versa. A technique for controlling the amplitude of the signal (or, equivalently, the quantization step size ${\displaystyle \Delta }$) to achieve the appropriate balance is the use of automatic gain control (AGC). However, in some quantizer designs, the concepts of granular error and overload error may not apply (e.g., for a quantizer with a limited range of input data or with a countably infinite set of selectable output values). [6]

### Rate–distortion quantizer design

A scalar quantizer, which performs a quantization operation, can ordinarily be decomposed into two stages:

Classification
A process that classifies the input signal range into ${\displaystyle M}$ non-overlapping intervals ${\displaystyle \{I_{k}\}_{k=1}^{M}}$, by defining ${\displaystyle M-1}$decision boundary values ${\displaystyle \{b_{k}\}_{k=1}^{M-1}}$, such that ${\displaystyle I_{k}=[b_{k-1}~,~b_{k})}$ for ${\displaystyle k=1,2,\ldots ,M}$, with the extreme limits defined by ${\displaystyle b_{0}=-\infty }$ and ${\displaystyle b_{M}=\infty }$. All the inputs ${\displaystyle x}$ that fall in a given interval range ${\displaystyle I_{k}}$ are associated with the same quantization index ${\displaystyle k}$.
Reconstruction
Each interval ${\displaystyle I_{k}}$ is represented by a reconstruction value${\displaystyle y_{k}}$ which implements the mapping ${\displaystyle x\in I_{k}\Rightarrow y=y_{k}}$.

These two stages together comprise the mathematical operation of ${\displaystyle y=Q(x)}$.

Entropy coding techniques can be applied to communicate the quantization indices from a source encoder that performs the classification stage to a decoder that performs the reconstruction stage. One way to do this is to associate each quantization index ${\displaystyle k}$ with a binary codeword ${\displaystyle c_{k}}$. An important consideration is the number of bits used for each codeword, denoted here by ${\displaystyle \mathrm {length} (c_{k})}$. As a result, the design of an ${\displaystyle M}$-level quantizer and an associated set of codewords for communicating its index values requires finding the values of ${\displaystyle \{b_{k}\}_{k=1}^{M-1}}$, ${\displaystyle \{c_{k}\}_{k=1}^{M}}$ and ${\displaystyle \{y_{k}\}_{k=1}^{M}}$ which optimally satisfy a selected set of design constraints such as the bit rate${\displaystyle R}$ and distortion${\displaystyle D}$.

Assuming that an information source ${\displaystyle S}$ produces random variables ${\displaystyle X}$ with an associated PDF ${\displaystyle f(x)}$, the probability ${\displaystyle p_{k}}$ that the random variable falls within a particular quantization interval ${\displaystyle I_{k}}$ is given by:

${\displaystyle p_{k}=P[x\in I_{k}]=\int _{b_{k-1}}^{b_{k}}f(x)dx}$.

The resulting bit rate ${\displaystyle R}$, in units of average bits per quantized value, for this quantizer can be derived as follows:

${\displaystyle R=\sum _{k=1}^{M}p_{k}\cdot \mathrm {length} (c_{k})=\sum _{k=1}^{M}\mathrm {length} (c_{k})\int _{b_{k-1}}^{b_{k}}f(x)dx}$.

If it is assumed that distortion is measured by mean squared error, [lower-alpha 1] the distortion D, is given by:

${\displaystyle D=E[(x-Q(x))^{2}]=\int _{-\infty }^{\infty }(x-Q(x))^{2}f(x)dx=\sum _{k=1}^{M}\int _{b_{k-1}}^{b_{k}}(x-y_{k})^{2}f(x)dx}$.

A key observation is that rate ${\displaystyle R}$ depends on the decision boundaries ${\displaystyle \{b_{k}\}_{k=1}^{M-1}}$ and the codeword lengths ${\displaystyle \{\mathrm {length} (c_{k})\}_{k=1}^{M}}$, whereas the distortion ${\displaystyle D}$ depends on the decision boundaries ${\displaystyle \{b_{k}\}_{k=1}^{M-1}}$ and the reconstruction levels ${\displaystyle \{y_{k}\}_{k=1}^{M}}$.

After defining these two performance metrics for the quantizer, a typical rate–distortion formulation for a quantizer design problem can be expressed in one of two ways:

1. Given a maximum distortion constraint ${\displaystyle D\leq D_{\max }}$, minimize the bit rate ${\displaystyle R}$
2. Given a maximum bit rate constraint ${\displaystyle R\leq R_{\max }}$, minimize the distortion ${\displaystyle D}$

Often the solution to these problems can be equivalently (or approximately) expressed and solved by converting the formulation to the unconstrained problem ${\displaystyle \min \left\{D+\lambda \cdot R\right\}}$ where the Lagrange multiplier ${\displaystyle \lambda }$ is a non-negative constant that establishes the appropriate balance between rate and distortion. Solving the unconstrained problem is equivalent to finding a point on the convex hull of the family of solutions to an equivalent constrained formulation of the problem. However, finding a solution – especially a closed-form solution – to any of these three problem formulations can be difficult. Solutions that do not require multi-dimensional iterative optimization techniques have been published for only three PDFs: the uniform, [18] exponential, [12] and Laplacian [12] distributions. Iterative optimization approaches can be used to find solutions in other cases. [6] [19] [20]

Note that the reconstruction values ${\displaystyle \{y_{k}\}_{k=1}^{M}}$ affect only the distortion – they do not affect the bit rate – and that each individual ${\displaystyle y_{k}}$ makes a separate contribution ${\displaystyle d_{k}}$ to the total distortion as shown below:

${\displaystyle D=\sum _{k=1}^{M}d_{k}}$

where

${\displaystyle d_{k}=\int _{b_{k-1}}^{b_{k}}(x-y_{k})^{2}f(x)dx}$

This observation can be used to ease the analysis – given the set of ${\displaystyle \{b_{k}\}_{k=1}^{M-1}}$ values, the value of each ${\displaystyle y_{k}}$ can be optimized separately to minimize its contribution to the distortion ${\displaystyle D}$.

For the mean-square error distortion criterion, it can be easily shown that the optimal set of reconstruction values ${\displaystyle \{y_{k}^{*}\}_{k=1}^{M}}$ is given by setting the reconstruction value ${\displaystyle y_{k}}$ within each interval ${\displaystyle I_{k}}$ to the conditional expected value (also referred to as the centroid ) within the interval, as given by:

${\displaystyle y_{k}^{*}={\frac {1}{p_{k}}}\int _{b_{k-1}}^{b_{k}}xf(x)dx}$.

The use of sufficiently well-designed entropy coding techniques can result in the use of a bit rate that is close to the true information content of the indices ${\displaystyle \{k\}_{k=1}^{M}}$, such that effectively

${\displaystyle \mathrm {length} (c_{k})\approx -\log _{2}\left(p_{k}\right)}$

and therefore

${\displaystyle R=\sum _{k=1}^{M}-p_{k}\cdot \log _{2}\left(p_{k}\right)}$.

The use of this approximation can allow the entropy coding design problem to be separated from the design of the quantizer itself. Modern entropy coding techniques such as arithmetic coding can achieve bit rates that are very close to the true entropy of a source, given a set of known (or adaptively estimated) probabilities ${\displaystyle \{p_{k}\}_{k=1}^{M}}$.

In some designs, rather than optimizing for a particular number of classification regions ${\displaystyle M}$, the quantizer design problem may include optimization of the value of ${\displaystyle M}$ as well. For some probabilistic source models, the best performance may be achieved when ${\displaystyle M}$ approaches infinity.

### Neglecting the entropy constraint: Lloyd–Max quantization

In the above formulation, if the bit rate constraint is neglected by setting ${\displaystyle \lambda }$ equal to 0, or equivalently if it is assumed that a fixed-length code (FLC) will be used to represent the quantized data instead of a variable-length code (or some other entropy coding technology such as arithmetic coding that is better than an FLC in the rate–distortion sense), the optimization problem reduces to minimization of distortion ${\displaystyle D}$ alone.

The indices produced by an ${\displaystyle M}$-level quantizer can be coded using a fixed-length code using ${\displaystyle R=\lceil \log _{2}M\rceil }$ bits/symbol. For example, when ${\displaystyle M=}$256 levels, the FLC bit rate ${\displaystyle R}$ is 8 bits/symbol. For this reason, such a quantizer has sometimes been called an 8-bit quantizer. However using an FLC eliminates the compression improvement that can be obtained by use of better entropy coding.

Assuming an FLC with ${\displaystyle M}$ levels, the rate–distortion minimization problem can be reduced to distortion minimization alone. The reduced problem can be stated as follows: given a source ${\displaystyle X}$ with PDF ${\displaystyle f(x)}$ and the constraint that the quantizer must use only ${\displaystyle M}$ classification regions, find the decision boundaries ${\displaystyle \{b_{k}\}_{k=1}^{M-1}}$ and reconstruction levels ${\displaystyle \{y_{k}\}_{k=1}^{M}}$ to minimize the resulting distortion

${\displaystyle D=E[(x-Q(x))^{2}]=\int _{-\infty }^{\infty }(x-Q(x))^{2}f(x)dx=\sum _{k=1}^{M}\int _{b_{k-1}}^{b_{k}}(x-y_{k})^{2}f(x)dx=\sum _{k=1}^{M}d_{k}}$.

Finding an optimal solution to the above problem results in a quantizer sometimes called a MMSQE (minimum mean-square quantization error) solution, and the resulting PDF-optimized (non-uniform) quantizer is referred to as a Lloyd–Max quantizer, named after two people who independently developed iterative methods [6] [21] [22] to solve the two sets of simultaneous equations resulting from ${\displaystyle {\partial D/\partial b_{k}}=0}$ and ${\displaystyle {\partial D/\partial y_{k}}=0}$, as follows:

${\displaystyle {\partial D \over \partial b_{k}}=0\Rightarrow b_{k}={y_{k}+y_{k+1} \over 2}}$,

which places each threshold at the midpoint between each pair of reconstruction values, and

${\displaystyle {\partial D \over \partial y_{k}}=0\Rightarrow y_{k}={\int _{b_{k-1}}^{b_{k}}xf(x)dx \over \int _{b_{k-1}}^{b_{k}}f(x)dx}={\frac {1}{p_{k}}}\int _{b_{k-1}}^{b_{k}}xf(x)dx}$

which places each reconstruction value at the centroid (conditional expected value) of its associated classification interval.

Lloyd's Method I algorithm, originally described in 1957, can be generalized in a straightforward way for application to vector data. This generalization results in the Linde–Buzo–Gray (LBG) or k-means classifier optimization methods. Moreover, the technique can be further generalized in a straightforward way to also include an entropy constraint for vector data. [23]

### Uniform quantization and the 6 dB/bit approximation

The Lloyd–Max quantizer is actually a uniform quantizer when the input PDF is uniformly distributed over the range ${\displaystyle [y_{1}-\Delta /2,~y_{M}+\Delta /2)}$. However, for a source that does not have a uniform distribution, the minimum-distortion quantizer may not be a uniform quantizer. The analysis of a uniform quantizer applied to a uniformly distributed source can be summarized in what follows:

A symmetric source X can be modelled with ${\displaystyle f(x)={\tfrac {1}{2X_{\max }}}}$, for ${\displaystyle x\in [-X_{\max },X_{\max }]}$ and 0 elsewhere. The step size ${\displaystyle \Delta ={\tfrac {2X_{\max }}{M}}}$ and the signal to quantization noise ratio (SQNR) of the quantizer is

${\displaystyle {\rm {SQNR}}=10\log _{10}{\frac {\sigma _{x}^{2}}{\sigma _{q}^{2}}}=10\log _{10}{\frac {(M\Delta )^{2}/12}{\Delta ^{2}/12}}=10\log _{10}M^{2}=20\log _{10}M}$.

For a fixed-length code using ${\displaystyle N}$ bits, ${\displaystyle M=2^{N}}$, resulting in ${\displaystyle {\rm {SQNR}}=20\log _{10}{2^{N}}=N\cdot (20\log _{10}2)=N\cdot 6.0206\,{\rm {dB}}}$,

or approximately 6 dB per bit. For example, for ${\displaystyle N}$=8 bits, ${\displaystyle M}$=256 levels and SQNR = 8×6 = 48 dB; and for ${\displaystyle N}$=16 bits, ${\displaystyle M}$=65536 and SQNR = 16×6 = 96 dB. The property of 6 dB improvement in SQNR for each extra bit used in quantization is a well-known figure of merit. However, it must be used with care: this derivation is only for a uniform quantizer applied to a uniform source. For other source PDFs and other quantizer designs, the SQNR may be somewhat different from that predicted by 6 dB/bit, depending on the type of PDF, the type of source, the type of quantizer, and the bit rate range of operation.

However, it is common to assume that for many sources, the slope of a quantizer SQNR function can be approximated as 6 dB/bit when operating at a sufficiently high bit rate. At asymptotically high bit rates, cutting the step size in half increases the bit rate by approximately 1 bit per sample (because 1 bit is needed to indicate whether the value is in the left or right half of the prior double-sized interval) and reduces the mean squared error by a factor of 4 (i.e., 6 dB) based on the ${\displaystyle \Delta ^{2}/12}$ approximation.

At asymptotically high bit rates, the 6 dB/bit approximation is supported for many source PDFs by rigorous theoretical analysis. [2] [3] [5] [6] Moreover, the structure of the optimal scalar quantizer (in the rate–distortion sense) approaches that of a uniform quantizer under these conditions. [5] [6]

## In other fields

Many physical quantities are actually quantized by physical entities. Examples of fields where this limitation applies include electronics (due to electrons), optics (due to photons), biology (due to DNA), physics (due to Planck limits) and chemistry (due to molecules).

## Notes

1. Other distortion measures can also be considered, although mean squared error is a popular one.

## Related Research Articles

In information theory, the entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent to the variable's possible outcomes. Given a discrete random variable , with possible outcomes , which occur with probability the entropy of is formally defined as:

In electronics, an analog-to-digital converter is a system that converts an analog signal, such as a sound picked up by a microphone or light entering a digital camera, into a digital signal. An ADC may also provide an isolated measurement such as an electronic device that converts an analog input voltage or current to a digital number representing the magnitude of the voltage or current. Typically the digital output is a two's complement binary number that is proportional to the input, but there are other possibilities.

A delta modulation is an analog-to-digital and digital-to-analog signal conversion technique used for transmission of voice information where quality is not of primary importance. DM is the simplest form of differential pulse-code modulation (DPCM) where the difference between successive samples is encoded into n-bit data streams. In delta modulation, the transmitted data are reduced to a 1-bit data stream. Its main features are:

In signal processing, a digital filter is a system that performs mathematical operations on a sampled, discrete-time signal to reduce or enhance certain aspects of that signal. This is in contrast to the other major type of electronic filter, the analog filter, which is typically an electronic circuit operating on continuous-time analog signals.

Signal-to-noise ratio is a measure used in science and engineering that compares the level of a desired signal to the level of background noise. SNR is defined as the ratio of signal power to the noise power, often expressed in decibels. A ratio higher than 1:1 indicates more signal than noise.

Rate–distortion theory is a major branch of information theory which provides the theoretical foundations for lossy data compression; it addresses the problem of determining the minimal number of bits per symbol, as measured by the rate R, that should be communicated over a channel, so that the source can be approximately reconstructed at the receiver without exceeding an expected distortion D.

In signal processing, sampling is the reduction of a continuous-time signal to a discrete-time signal. A common example is the conversion of a sound wave to a sequence of "samples". A sample is a value of the signal at a point in time and/or space; this definition differs from the usage in statistics, which refers to a set of such values.

A numerically-controlled oscillator (NCO) is a digital signal generator which creates a synchronous, discrete-time, discrete-valued representation of a waveform, usually sinusoidal. NCOs are often used in conjunction with a digital-to-analog converter (DAC) at the output to create a direct digital synthesizer (DDS).

Noise shaping is a technique typically used in digital audio, image, and video processing, usually in combination with dithering, as part of the process of quantization or bit-depth reduction of a digital signal. Its purpose is to increase the apparent signal-to-noise ratio of the resultant signal. It does this by altering the spectral shape of the error that is introduced by dithering and quantization; such that the noise power is at a lower level in frequency bands at which noise is considered to be less desirable and at a correspondingly higher level in bands where it is considered to be more desirable. A popular noise shaping algorithm used in image processing is known as ‘Floyd Steinberg dithering’; and many noise shaping algorithms used in audio processing are based on an ‘Absolute threshold of hearing’ model.

Phase correlation is an approach to estimate the relative translative offset between two similar images or other data sets. It is commonly used in image registration and relies on a frequency-domain representation of the data, usually calculated by fast Fourier transforms. The term is applied particularly to a subset of cross-correlation techniques that isolate the phase information from the Fourier-space representation of the cross-correlogram.

Delta-sigma modulation is a method for encoding analog signals into digital signals as found in an analog-to-digital converter (ADC). It is also used to convert high bit-count, low-frequency digital signals into lower bit-count, higher-frequency digital signals as part of the process to convert digital signals into analog as part of a digital-to-analog converter (DAC).

In general, noise is uncontrolled random variation from an expected value and is typically unwanted. General causes are thermal fluctuations, mechanical vibrations, industrial noise, fluctuations of voltage from a power supply, thermal noise due to Brownian motion, instrumentation noise, a laser’s output mode deviating from the desired mode of operation, etc. Quantum noise is observed in a system where conventional sources of noise are suppressed. The dominant noise are due to the discrete nature of the very small, arises from the uncertainty principle, and Zero-point energy fluctuations. Quantified noise is similar to classical noise theory and will not always return an asymmetric spectral density.

Signal-to-quantization-noise ratio is widely used quality measure in analysing digitizing schemes such as pulse-code modulation (PCM). The SQNR reflects the relationship between the maximum nominal signal strength and the quantization error introduced in the analog-to-digital conversion.

In digital audio using pulse-code modulation (PCM), bit depth is the number of bits of information in each sample, and it directly corresponds to the resolution of each sample. Examples of bit depth include Compact Disc Digital Audio, which uses 16 bits per sample, and DVD-Audio and Blu-ray Disc which can support up to 24 bits per sample.

Effective number of bits (ENOB) is a measure of the dynamic range of an analog-to-digital converter (ADC), digital-to-analog converter, or their associated circuitry. The resolution of an ADC is specified by the number of bits used to represent the analog value. Ideally, a 12-bit ADC will have an effective number of bits of almost 12. However, real signals have noise, and real circuits are imperfect and introduce additional noise and distortion. Those imperfections reduce the number of bits of accuracy in the ADC. The ENOB describes the effective resolution of the system in bits. An ADC may have a 12-bit resolution but the effective number of bits, when used in a system, may be 9.5.

Precoding is a generalization of beamforming to support multi-stream transmission in multi-antenna wireless communications. In conventional single-stream beamforming, the same signal is emitted from each of the transmit antennas with appropriate weighting such that the signal power is maximized at the receiver output. When the receiver has multiple antennas, single-stream beamforming cannot simultaneously maximize the signal level at all of the receive antennas. In order to maximize the throughput in multiple receive antenna systems, multi-stream transmission is generally required.

In mathematics, Pythagorean addition is a binary operation on the real numbers that computes the length of the hypotenuse of a right triangle, given its two sides. According to the Pythagorean theorem, for a triangle with sides and , this length can be calculated as

A randomness extractor, often simply called an "extractor", is a function, which being applied to output from a weakly random entropy source, together with a short, uniformly random seed, generates a highly random output that appears independent from the source and uniformly distributed. Examples of weakly random sources include radioactive decay or thermal noise; the only restriction on possible sources is that there is no way they can be fully controlled, calculated or predicted, and that a lower bound on their entropy rate can be established. For a given source, a randomness extractor can even be considered to be a true random number generator (TRNG); but there is no single extractor that has been proven to produce truly random output from any type of weakly random source.

Pulse-density modulation, or PDM, is a form of modulation used to represent an analog signal with a binary signal. In a PDM signal, specific amplitude values are not encoded into codewords of pulses of different weight as they would be in pulse-code modulation (PCM); rather, the relative density of the pulses corresponds to the analog signal's amplitude. The output of a 1-bit DAC is the same as the PDM encoding of the signal. Pulse-width modulation (PWM) is a special case of PDM where the switching frequency is fixed and all the pulses corresponding to one sample are contiguous in the digital signal. For a 50% voltage with a resolution of 8-bits, a PWM waveform will turn on for 128 clock cycles and then off for the remaining 128 cycles. With PDM and the same clock rate the signal would alternate between on and off every other cycle. The average is 50% for both waveforms, but the PDM signal switches more often. For 100% or 0% level, they are the same.

In multidimensional signal processing, Multidimensional signal restoration refers to the problem of estimating the original input signal from observations of the distorted or noise contaminated version of the original signal using some prior information about the input signal and /or the distortion process. Multidimensional signal processing systems such as audio, image and video processing systems often receive as input, signals that undergo distortions like blurring, band-limiting etc. during signal acquisition or transmission and it may be vital to recover the original signal for further filtering. Multidimensional signal restoration is an inverse problem, where only the distorted signal is observed and some information about the distortion process and/or input signal properties is known. A general class of iterative methods have been developed for the multidimensional restoration problem with successful applications to multidimensional deconvolution, signal extrapolation and denoising.

## References

1. Sheppard, W. F. (1897). "On the Calculation of the most Probable Values of Frequency-Constants, for Data arranged according to Equidistant Division of a Scale". Proceedings of the London Mathematical Society. Wiley. s1-29 (1): 353–380. doi:10.1112/plms/s1-29.1.353. ISSN   0024-6115.
2. W. R. Bennett, "Spectra of Quantized Signals", Bell System Technical Journal , Vol. 27, pp. 446–472, July 1948.
3. Oliver, B.M.; Pierce, J.R.; Shannon, C.E. (1948). "The Philosophy of PCM". Proceedings of the IRE. Institute of Electrical and Electronics Engineers (IEEE). 36 (11): 1324–1331. doi:10.1109/jrproc.1948.231941. ISSN   0096-8390.
4. Seymour Stein and J. Jay Jones, Modern Communication Principles , McGraw–Hill, ISBN   978-0-07-061003-3, 1967 (p. 196).
5. Gish, H.; Pierce, J. (1968). "Asymptotically efficient quantizing". IEEE Transactions on Information Theory. Institute of Electrical and Electronics Engineers (IEEE). 14 (5): 676–683. doi:10.1109/tit.1968.1054193. ISSN   0018-9448.
6. Gray, R.M.; Neuhoff, D.L. (1998). "Quantization". IEEE Transactions on Information Theory. Institute of Electrical and Electronics Engineers (IEEE). 44 (6): 2325–2383. doi:10.1109/18.720541. ISSN   0018-9448.
7. Hodgson, Jay (2010). Understanding Records, p.56. ISBN   978-1-4411-5607-5. Adapted from Franz, David (2004). Recording and Producing in the Home Studio, p.38-9. Berklee Press.
8. Gersho, A. (1977). "Quantization". IEEE Communications Society Magazine. Institute of Electrical and Electronics Engineers (IEEE). 15 (5): 16–28. doi:10.1109/mcom.1977.1089500. ISSN   0148-9615.
9. Rabbani, Majid; Joshi, Rajan L.; Jones, Paul W. (2009). "Section 1.2.3: Quantization, in Chapter 1: JPEG 2000 Core Coding System (Part 1)". In Schelkens, Peter; Skodras, Athanassios; Ebrahimi, Touradj (eds.). . John Wiley & Sons. pp.  22–24. ISBN   978-0-470-72147-6.
10. Taubman, David S.; Marcellin, Michael W. (2002). "Chapter 3: Quantization". . Kluwer Academic Publishers. p.  107. ISBN   0-7923-7519-X.
11. Sullivan, G.J. (1996). "Efficient scalar quantization of exponential and Laplacian random variables". IEEE Transactions on Information Theory. Institute of Electrical and Electronics Engineers (IEEE). 42 (5): 1365–1374. doi:10.1109/18.532878. ISSN   0018-9448.
12. Widrow, B. (1956). "A Study of Rough Amplitude Quantization by Means of Nyquist Sampling Theory". IRE Transactions on Circuit Theory. Institute of Electrical and Electronics Engineers (IEEE). 3 (4): 266–276. doi:10.1109/tct.1956.1086334. ISSN   0096-2007.
13. Bernard Widrow, "Statistical analysis of amplitude quantized sampled data systems", Trans. AIEE Pt. II: Appl. Ind., Vol. 79, pp. 555–568, Jan. 1961.
14. Marco, D.; Neuhoff, D.L. (2005). "The Validity of the Additive Noise Model for Uniform Scalar Quantizers". IEEE Transactions on Information Theory. Institute of Electrical and Electronics Engineers (IEEE). 51 (5): 1739–1755. doi:10.1109/tit.2005.846397. ISSN   0018-9448.
15. Pohlman, Ken C. (1989). Principles of Digital Audio 2nd Edition. SAMS. p. 60. ISBN   9780071441568.
16. Watkinson, John (2001). The Art of Digital Audio 3rd Edition. Focal Press. ISBN   0-240-51587-0.
17. Farvardin, N.; Modestino, J. (1984). "Optimum quantizer performance for a class of non-Gaussian memoryless sources". IEEE Transactions on Information Theory. Institute of Electrical and Electronics Engineers (IEEE). 30 (3): 485–497. doi:10.1109/tit.1984.1056920. ISSN   0018-9448.(Section VI.C and Appendix B)
18. Berger, T. (1972). "Optimum quantizers and permutation codes". IEEE Transactions on Information Theory. Institute of Electrical and Electronics Engineers (IEEE). 18 (6): 759–765. doi:10.1109/tit.1972.1054906. ISSN   0018-9448.
19. Berger, T. (1982). "Minimum entropy quantizers and permutation codes". IEEE Transactions on Information Theory. Institute of Electrical and Electronics Engineers (IEEE). 28 (2): 149–157. doi:10.1109/tit.1982.1056456. ISSN   0018-9448.
20. Lloyd, S. (1982). "Least squares quantization in PCM". IEEE Transactions on Information Theory. Institute of Electrical and Electronics Engineers (IEEE). 28 (2): 129–137. doi:10.1109/tit.1982.1056489. ISSN   0018-9448. (work documented in a manuscript circulated for comments at Bell Laboratories with a department log date of 31 July 1957 and also presented at the 1957 meeting of the Institute of Mathematical Statistics, although not formally published until 1982).
21. Max, J. (1960). "Quantizing for minimum distortion". IEEE Transactions on Information Theory. Institute of Electrical and Electronics Engineers (IEEE). 6 (1): 7–12. doi:10.1109/tit.1960.1057548. ISSN   0018-9448.
22. Chou, P.A.; Lookabaugh, T.; Gray, R.M. (1989). "Entropy-constrained vector quantization". IEEE Transactions on Acoustics, Speech, and Signal Processing. Institute of Electrical and Electronics Engineers (IEEE). 37 (1): 31–42. doi:10.1109/29.17498. ISSN   0096-3518.