Pyramid vector quantization

Last updated August 10, 2023

Pyramid vector quantization (PVQ) is a method used in audio and video codecs to quantize and transmit unit vectors, i.e. vectors whose magnitudes are known to the decoder but whose directions are unknown. PVQ may also be used as part of a gain/shape quantization scheme, whereby the magnitude and direction of a vector are quantized separately from each other. PVQ was initially described in 1986 in the paper "A Pyramid Vector Quantizer" by Thomas R. Fischer.^[1]

One caveat of PVQ is that it operates under the taxicab distance (L1-norm). Conversion to/from the more familiar Euclidean distance (L2-norm) is possible via vector projection, though results in a less uniform distribution of quantization points (the poles of the Euclidean n-sphere become denser than non-poles).^[3] No efficient algorithm for the ideal (i.e., uniform) vector quantization of the Euclidean n-sphere is known as of 2010.^[4] This non-uniformity can be reduced by applying deformation like coordinate-wise power before projection, reducing mean-squared quantization error by ~10%.^[2]

PVQ is used in the CELT audio codec (now Opus) and the Daala video codec.

Overview

As a form of vector quantization, PVQ defines a codebook of M quantization points, each of which is assigned an integer codeword from 0 to M−1. The goal of the encoder is to find the codeword of the closest vector, which the decoder must decode back into a vector.

The PVQ codebook consists of all N-dimensional points ${\vec {p}}$ with integer-only coordinates whose absolute values sum to a constant K (i.e. whose L1-norm equals K). In set-builder notation:

S(N,K)=\left\{{\vec {p}}\in \mathbb {Z} ^{N}:\left\|{\vec {p}}\right\|_{1}=K\right\}

where $\left\|{\vec {p}}\right\|_{1}$ denotes the L1-norm of ${\vec {p}}$ .

As it stands, the set S tesselates the surface of an N-dimensional pyramid. If desired, we may reshape it into a sphere by "projecting" the points onto the sphere, i.e. by normalizing them:

S_{\text{sphere}}(N,K)=\left\{{\frac {\vec {p}}{\left\|{\vec {p}}\right\|_{2}}}:{\vec {p}}\in S(N,K)\right\}

where $\left\|{\vec {p}}\right\|_{2}$ denotes the L2-norm of ${\vec {p}}$ .

Increasing the parameter K results in more quantization points, and hence typically yields a more "accurate" approximation of the original unit vector ${\vec {v}}$ at the cost of larger integer codewords that require more bits to transmit.

Example

Suppose we wish to quantize three-dimensional unit vectors using the parameter K=2. Our codebook becomes:

Codeword	Point	Normalized point
0	<−2, 0, 0>	<−1.000, 0.000, 0.000>
1	<−1, −1, 0>	<−0.707, −0.707, 0.000>
2	<−1, 0, −1>	<−0.707, 0.000, −0.707>
3	<−1, 0, 1>	<−0.707, 0.000, 0.707>
4	<−1, 1, 0>	<−0.707, 0.707, 0.000>
5	<0, −2, 0>	<0.000, −1.000, 0.000>
6	<0, −1, −1>	<0.000, −0.707, −0.707>
7	<0, −1, 1>	<0.000, −0.707, 0.707>
8	<0, 0, −2>	<0.000, 0.000, −1.000>
9	<0, 0, 2>	<0.000, 0.000, 1.000>

Codeword	Point	Normalized point
10	<0, 1, −1>	<0.000, 0.707, −0.707>
11	<0, 1, 1>	<0.000, 0.707, 0.707>
12	<0, 2, 0>	<0.000, 1.000, 0.000>
13	<1, −1, 0>	<0.707, −0.707, 0.000>
14	<1, 0, −1>	<0.707, 0.000, −0.707>
15	<1, 0, 1>	<0.707, 0.000, 0.707>
16	<1, 1, 0>	<0.707, 0.707, 0.000>
17	<2, 0, 0>	<1.000, 0.000, 0.000>

(0.707 = ${\sqrt {2}}/2$ rounded to 3 decimal places.)

Now, suppose we wish to transmit the unit vector <0.592, −0.720, 0.362> (rounded here to 3 decimal places, for clarity). According to our codebook, the closest point we can pick is codeword 13 (<0.707, −0.707, 0.000>), located approximately 0.381 units away from our original point.

Increasing the parameter K results in a larger codebook, which typically increases the reconstruction accuracy. For example, based on the Python code below, K=5 (codebook size: 102) yields an error of only 0.097 units, and K=20 (codebook size: 1602) yields an error of only 0.042 units.

Python code

importitertoolsimportmathfromtypingimportList,NamedTuple,TupleclassPVQEntry(NamedTuple):codeword:intpoint:Tuple[int,...]normalizedPoint:Tuple[float,...]defcreate_pvq_codebook(n:int,k:int)->List[PVQEntry]:"""    Naive algorithm to generate an n-dimensional PVQ codebook    with k pulses.    Runtime complexity: O(k**n)    """ret=[]forpinitertools.product(range(-k,k+1),repeat=n):ifsum(abs(x)forxinp)==k:norm=math.sqrt(sum(abs(x)**2forxinp))q=tuple(x/normforxinp)ret.append(PVQEntry(len(ret),p,q))returnretdefsearch_pvq_codebook(codebook:List[PVQEntry],p:Tuple[float,...])->Tuple[PVQEntry,float]:"""    Naive algorithm to search the PVQ codebook. Returns the point in the    codebook that's "closest" to p, according to the Euclidean distance.)    """ret=Nonemin_dist=Noneforiinrange(len(codebook)):q=codebook[i].normalizedPointdist=math.sqrt(sum(abs(q[j]-p[j])**2forjinrange(len(p))))ifmin_distisNoneordist<min_dist:ret=codebook[i]min_dist=distreturnret,min_distdefexample(p:Tuple[float,...],k:int)->None:n=len(p)codebook=create_pvq_codebook(n,k)print("Number of codebook entries: "+str(len(codebook)))entry,dist=search_pvq_codebook(codebook,p)print("Best entry: "+str(entry))print("Distance: "+str(dist))phi=1.2theta=5.4x=math.sin(phi)*math.cos(theta)y=math.sin(phi)*math.sin(theta)z=math.cos(phi)p=(x,y,z)example(p,2)example(p,5)example(p,20)

Complexity

The PVQ codebook can be searched in $O(KN)$ .^[4] Encoding and decoding can likewise be performed in $O(KN)$ using $O(K+N)$ memory.^[5]

The codebook size obeys the recurrence^[4]

V(N,K)=V(N-1,K)+V(N,K-1)+V(N-1,K-1)

with $V(N,0)=1$ for all $N\geq 0$ and $V(0,K)=0$ for all $K\neq 0$ .

A closed-form solution is given by^[6]

V(N,K)=2N\cdot {}_{2}F_{1}(1-K,1-N;2;2).

where ${}_{2}F_{1}$ is the hypergeometric function.

Related Research Articles

In vector calculus and differential geometry the generalized Stokes theorem, also called the Stokes–Cartan theorem, is a statement about the integration of differential forms on manifolds, which both simplifies and generalizes several theorems from vector calculus. In particular, the fundamental theorem of calculus is the special case where the manifold is a line segment, Green’s theorem and Stokes' theorem are the cases of a surface in $or and the divergence theorem is the case of a volume in Hence, the theorem is sometimes referred to as the Fundamental Theorem of Multivariate Calculus .$

In information theory, the Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. In other words, it measures the minimum number of substitutions required to change one string into the other, or the minimum number of errors that could have transformed one string into the other. In a more general context, the Hamming distance is one of several string metrics for measuring the edit distance between two sequences. It is named after the American mathematician Richard Hamming.

Vector quantization (VQ) is a classical quantization technique from signal processing that allows the modeling of probability density functions by the distribution of prototype vectors. It was originally used for data compression. It works by dividing a large set of points (vectors) into groups having approximately the same number of points closest to them. Each group is represented by its centroid point, as in k-means and some other clustering algorithms.

In mathematics, the quaternion number system extends the complex numbers. Quaternions were first described by the Irish mathematician William Rowan Hamilton in 1843 and applied to mechanics in three-dimensional space. Hamilton defined a quaternion as the quotient of two directed lines in a three-dimensional space, or, equivalently, as the quotient of two vectors. Multiplication of quaternions is noncommutative.

In geometry, a solid angle is a measure of the amount of the field of view from some particular point that a given object covers. That is, it is a measure of how large the object appears to an observer looking from that point. The point from which the object is viewed is called the apex of the solid angle, and the object is said to subtend its solid angle at that point.

Additive white Gaussian noise (AWGN) is a basic noise model used in information theory to mimic the effect of many random processes that occur in nature. The modifiers denote specific characteristics:

In geometry, a barycentric coordinate system is a coordinate system in which the location of a point is specified by reference to a simplex. The barycentric coordinates of a point can be interpreted as masses placed at the vertices of the simplex, such that the point is the center of mass of these masses. These masses can be zero or negative; they are all positive if and only if the point is inside the simplex.

In computer science, learning vector quantization (LVQ) is a prototype-based supervised classification algorithm. LVQ is the supervised counterpart of vector quantization systems.

In mathematics, a norm is a function from a real or complex vector space to the non-negative real numbers that behaves in certain ways like the distance from the origin: it commutes with scaling, obeys a form of the triangle inequality, and is zero only at the origin. In particular, the Euclidean distance in a Euclidean space is defined by a norm on the associated Euclidean vector space, called the Euclidean norm, the 2-norm, or, sometimes, the magnitude of the vector. This norm can be defined as the square root of the inner product of a vector with itself.

In mathematics, a functional is a certain type of function. The exact definition of the term varies depending on the subfield.

Code-excited linear prediction (CELP) is a linear predictive speech coding algorithm originally proposed by Manfred R. Schroeder and Bishnu S. Atal in 1985. At the time, it provided significantly better quality than existing low bit-rate algorithms, such as residual-excited linear prediction (RELP) and linear predictive coding (LPC) vocoders. Along with its variants, such as algebraic CELP, relaxed CELP, low-delay CELP and vector sum excited linear prediction, it is currently the most widely used speech coding algorithm. It is also used in MPEG-4 Audio speech coding. CELP is commonly used as a generic term for a class of algorithms and not for a particular codec.

In geometry, a three-dimensional space is a mathematical space in which three values (coordinates) are required to determine the position of a point. Most commonly, it is the three-dimensional Euclidean space, the Euclidean n-space of dimension n=3 that models physical space. More general three-dimensional spaces are called 3-manifolds.

In theoretical physics, scalar field theory can refer to a relativistically invariant classical or quantum theory of scalar fields. A scalar field is invariant under any Lorentz transformation.

In mathematics, Dvoretzky's theorem is an important structural theorem about normed vector spaces proved by Aryeh Dvoretzky in the early 1960s, answering a question of Alexander Grothendieck. In essence, it says that every sufficiently high-dimensional normed vector space will have low-dimensional subspaces that are approximately Euclidean. Equivalently, every high-dimensional bounded symmetric convex set has low-dimensional sections that are approximately ellipsoids.

In computer vision, the bag-of-words model sometimes called bag-of-visual-words model can be applied to image classification or retrieval, by treating image features as words. In document classification, a bag of words is a sparse vector of occurrence counts of words; that is, a sparse histogram over the vocabulary. In computer vision, a bag of visual words is a vector of occurrence counts of a vocabulary of local image features.

In geometry, a ball is a region in a space comprising all points within a fixed distance, called the radius, from a given point; that is, it is the region enclosed by a sphere or hypersphere. An $n$ -ball is a ball in an $n$ -dimensional Euclidean space. The volume of a $n$ -ball is the Lebesgue measure of this ball, which generalizes to any dimension the usual volume of a ball in 3-dimensional space. The volume of a $n$ -ball of radius $R$ is $where is the volume of the unit n -ball, the n -ball of radius 1 .$

Constrained Energy Lapped Transform (CELT) is an open, royalty-free lossy audio compression format and a free software codec with especially low algorithmic delay for use in low-latency audio communication. The algorithms are openly documented and may be used free of software patent restrictions. Development of the format was maintained by the Xiph.Org Foundation and later coordinated by the Opus working group of the Internet Engineering Task Force (IETF).

In mathematics and physics, vector is a term that refers colloquially to some quantities that cannot be expressed by a single number, or to elements of some vector spaces.

In mathematics, a unit sphere is simply a sphere of radius one around a given center. More generally, it is the set of points of distance 1 from a fixed central point, where different norms can be used as general notions of "distance". A unit ball is the closed set of points of distance less than or equal to 1 from a fixed central point. Usually the center is at the origin of the space, so one speaks of "the unit ball" or "the unit sphere". Special cases are the unit circle and the unit disk.

This is a glossary of linear algebra.

References

↑ Fischer, Thomas R. (July 1986). "A Pyramid Vector Quantizer". IEEE Transactions on Information Theory . 32 (4): 568–583. doi:10.1109/TIT.1986.1057198.
1 2 Duda, Jarek (2017). "Improving Pyramid Vector Quantizer with power projection". arXiv: 1705.05285 [math.OC].
↑ Valin, Jean-Marc (September 2013). "Pyramid Vector Quantization for Video Coding" (PDF). Xiph.Org Foundation . Retrieved April 4, 2021.
1 2 3 Valin, Jean-Marc; Terriberry, Timothy B.; Montgomery, Christopher; Maxwell, Gregory (January 2010). "A High-Quality Speech and Audio Codec With Less Than 10 ms Delay". IEEE Transactions on Audio, Speech, and Language Processing. 18 (1): 58–67. arXiv: 1602.05526 . doi:10.1109/TASL.2009.2023186. S2CID 11516136.
↑ Terriberry, Timothy B. (2009). "cwrs.c". Opus. Xiph.Org Foundation . Retrieved April 6, 2021.
↑ Terriberry, Timothy B. (December 2007). "Pulse Vector Coding". Xiph.Org Foundation. Archived from the original on September 30, 2019. Retrieved April 4, 2021.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Fischer, Thomas R. (July 1986). "A Pyramid Vector Quantizer". IEEE Transactions on Information Theory . 32 (4): 568–583. doi:10.1109/TIT.1986.1057198.

[imp_pvq-2] 1 2 Duda, Jarek (2017). "Improving Pyramid Vector Quantizer with power projection". arXiv: 1705.05285 [math.OC].

[3] Valin, Jean-Marc (September 2013). "Pyramid Vector Quantization for Video Coding" (PDF). Xiph.Org Foundation . Retrieved April 4, 2021.

[celt_tasl-4] 1 2 3 Valin, Jean-Marc; Terriberry, Timothy B.; Montgomery, Christopher; Maxwell, Gregory (January 2010). "A High-Quality Speech and Audio Codec With Less Than 10 ms Delay". IEEE Transactions on Audio, Speech, and Language Processing. 18 (1): 58–67. arXiv: 1602.05526 . doi:10.1109/TASL.2009.2023186. S2CID 11516136.

[5] Terriberry, Timothy B. (2009). "cwrs.c". Opus. Xiph.Org Foundation . Retrieved April 6, 2021.

[cwrs-6] Terriberry, Timothy B. (December 2007). "Pulse Vector Coding". Xiph.Org Foundation. Archived from the original on September 30, 2019. Retrieved April 4, 2021.

[1]

[3]

[4]

[2]

[5]

[6]