Line spectral pairs

Last updated

Line spectral pairs (LSP) or line spectral frequencies (LSF) are used to represent linear prediction coefficients (LPC) for transmission over a channel. [1] LSPs have several properties (e.g. smaller sensitivity to quantization noise) that make them superior to direct quantization of LPCs. For this reason, LSPs are very useful in speech coding.

Contents

LSP representation was developed by Fumitada Itakura, [2] at Nippon Telegraph and Telephone (NTT) in 1975. [3] From 1975 to 1981, he studied problems in speech analysis and synthesis based on the LSP method. [4] In 1980, his team developed an LSP-based speech synthesizer chip. LSP is an important technology for speech synthesis and coding, and in the 1990s was adopted by almost all international speech coding standards as an essential component, contributing to the enhancement of digital speech communication over mobile channels and the internet worldwide. [3] LSPs are used in the code-excited linear prediction (CELP) algorithm, developed by Bishnu S. Atal and Manfred R. Schroeder in 1985.

Mathematical foundation

The LP polynomial can be expressed as , where:

By construction, P is a palindromic polynomial and Q an antipalindromic polynomial; physically P(z) corresponds to the vocal tract with the glottis closed and Q(z) with the glottis open. [5] It can be shown that:

The Line Spectral Pair representation of the LP polynomial consists simply of the location of the roots of P and Q (i.e. such that ). As they occur in pairs, only half of the actual roots (conventionally between 0 and ) need be transmitted. The total number of coefficients for both P and Q is therefore equal to p, the number of original LP coefficients (not counting ).

A common algorithm for finding these [6] is to evaluate the polynomial at a sequence of closely spaced points around the unit circle, observing when the result changes sign; when it does a root must lie between the points tested. Because the roots of P are interspersed with those of Q a single pass is sufficient to find the roots of both polynomials.

To convert back to LPCs, we need to evaluate by "clocking" an impulse through it N times (order of the filter), yielding the original filter, A(z).

Properties

Line spectral pairs have several interesting and useful properties. When the roots of P(z) and Q(z) are interleaved, stability of the filter is ensured if and only if the roots are monotonically increasing. Moreover, the closer two roots are, the more resonant the filter is at the corresponding frequency. Because LSPs are not overly sensitive to quantization noise and stability is easily ensured, LSP are widely used for quantizing LPC filters. Line spectral frequencies can be interpolated.

See also

Sources

Includes an overview in relation to LPC.

Related Research Articles

Audio signal processing is a subfield of signal processing that is concerned with the electronic manipulation of audio signals. Audio signals are electronic representations of sound waves—longitudinal waves which travel through air, consisting of compressions and rarefactions. The energy contained in audio signals is typically measured in decibels. As audio signals may be represented in either digital or analog format, processing may occur in either domain. Analog processors operate directly on the electrical signal, while digital processors operate mathematically on its digital representation.

<span class="mw-page-title-main">Discrete Fourier transform</span> Type of Fourier transform in discrete mathematics

In mathematics, the discrete Fourier transform (DFT) converts a finite sequence of equally-spaced samples of a function into a same-length sequence of equally-spaced samples of the discrete-time Fourier transform (DTFT), which is a complex-valued function of frequency. The interval at which the DTFT is sampled is the reciprocal of the duration of the input sequence. An inverse DFT is a Fourier series, using the DTFT samples as coefficients of complex sinusoids at the corresponding DTFT frequencies. It has the same sample-values as the original input sequence. The DFT is therefore said to be a frequency domain representation of the original input sequence. If the original sequence spans all the non-zero values of a function, its DTFT is continuous, and the DFT provides discrete samples of one cycle. If the original sequence is one cycle of a periodic function, the DFT provides all the non-zero values of one DTFT cycle.

Speech processing is the study of speech signals and the processing methods of signals. The signals are usually processed in a digital representation, so speech processing can be regarded as a special case of digital signal processing, applied to speech signals. Aspects of speech processing includes the acquisition, manipulation, storage, transfer and output of speech signals. The input is called speech recognition and the output is called speech synthesis.

Speech coding is an application of data compression of digital audio signals containing speech. Speech coding uses speech-specific parameter estimation using audio signal processing techniques to model the speech signal, combined with generic data compression algorithms to represent the resulting modeled parameters in a compact bitstream.

Linear predictive coding (LPC) is a method used mostly in audio signal processing and speech processing for representing the spectral envelope of a digital signal of speech in compressed form, using the information of a linear predictive model.

<span class="mw-page-title-main">Digital filter</span> Filter used on discretely-sampled signals in signal processing

In signal processing, a digital filter is a system that performs mathematical operations on a sampled, discrete-time signal to reduce or enhance certain aspects of that signal. This is in contrast to the other major type of electronic filter, the analog filter, which is typically an electronic circuit operating on continuous-time analog signals.

<span class="mw-page-title-main">Digital audio</span> Technology that records, stores, and reproduces sound

Digital audio is a representation of sound recorded in, or converted into, digital form. In digital audio, the sound wave of the audio signal is typically encoded as numerical samples in a continuous sequence. For example, in CD audio, samples are taken 44,100 times per second, each with 16-bit sample depth. Digital audio is also the name for the entire technology of sound recording and reproduction using audio signals that have been encoded in digital form. Following significant advances in digital audio technology during the 1970s and 1980s, it gradually replaced analog audio technology in many areas of audio engineering, record production and telecommunications in the 1990s and 2000s.

<span class="mw-page-title-main">Sampling (signal processing)</span> Measurement of a signal at discrete time intervals

In signal processing, sampling is the reduction of a continuous-time signal to a discrete-time signal. A common example is the conversion of a sound wave to a sequence of "samples". A sample is a value of the signal at a point in time and/or space; this definition differs from the usage in statistics, which refers to a set of such values.

<span class="mw-page-title-main">Daubechies wavelet</span> Orthogonal wavelets

The Daubechies wavelets, based on the work of Ingrid Daubechies, are a family of orthogonal wavelets defining a discrete wavelet transform and characterized by a maximal number of vanishing moments for some given support. With each wavelet type of this class, there is a scaling function which generates an orthogonal multiresolution analysis.

Harmonic Vector Excitation Coding, abbreviated as HVXC is a speech coding algorithm specified in MPEG-4 Part 3 standard for very low bit rate speech coding. HVXC supports bit rates of 2 and 4 kbit/s in the fixed and variable bit rate mode and sampling frequency 8 kHz. It also operates at lower bitrates, such as 1.2 - 1.7 kbit/s, using a variable bit rate technique. The total algorithmic delay for the encoder and decoder is 36 ms.

Code-excited linear prediction (CELP) is a linear predictive speech coding algorithm originally proposed by Manfred R. Schroeder and Bishnu S. Atal in 1985. At the time, it provided significantly better quality than existing low bit-rate algorithms, such as residual-excited linear prediction (RELP) and linear predictive coding (LPC) vocoders. Along with its variants, such as algebraic CELP, relaxed CELP, low-delay CELP and vector sum excited linear prediction, it is currently the most widely used speech coding algorithm. It is also used in MPEG-4 Audio speech coding. CELP is commonly used as a generic term for a class of algorithms and not for a particular codec.

In statistics, econometrics and signal processing, an autoregressive (AR) model is a representation of a type of random process; as such, it is used to describe certain time-varying processes in nature, economics, etc. The autoregressive model specifies that the output variable depends linearly on its own previous values and on a stochastic term ; thus the model is in the form of a stochastic difference equation. Together with the moving-average (MA) model, it is a special case and key component of the more general autoregressive–moving-average (ARMA) and autoregressive integrated moving average (ARIMA) models of time series, which have a more complicated stochastic structure; it is also a special case of the vector autoregressive model (VAR), which consists of a system of more than one interlocking stochastic difference equation in more than one evolving random variable.

<span class="mw-page-title-main">Filter bank</span> Tool for Digital Signal Processing

In signal processing, a filter bank is an array of bandpass filters that separates the input signal into multiple components, each one carrying a single frequency sub-band of the original signal. One application of a filter bank is a graphic equalizer, which can attenuate the components differently and recombine them into a modified version of the original signal. The process of decomposition performed by the filter bank is called analysis ; the output of analysis is referred to as a subband signal with as many subbands as there are filters in the filter bank. The reconstruction process is called synthesis, meaning reconstitution of a complete signal resulting from the filtering process.

<span class="mw-page-title-main">Savitzky–Golay filter</span> Algorithm to smooth data points

A Savitzky–Golay filter is a digital filter that can be applied to a set of digital data points for the purpose of smoothing the data, that is, to increase the precision of the data without distorting the signal tendency. This is achieved, in a process known as convolution, by fitting successive sub-sets of adjacent data points with a low-degree polynomial by the method of linear least squares. When the data points are equally spaced, an analytical solution to the least-squares equations can be found, in the form of a single set of "convolution coefficients" that can be applied to all data sub-sets, to give estimates of the smoothed signal, at the central point of each sub-set. The method, based on established mathematical procedures, was popularized by Abraham Savitzky and Marcel J. E. Golay, who published tables of convolution coefficients for various polynomials and sub-set sizes in 1964. Some errors in the tables have been corrected. The method has been extended for the treatment of 2- and 3-dimensional data.

<span class="mw-page-title-main">Cohen–Daubechies–Feauveau wavelet</span>

Cohen–Daubechies–Feauveau wavelets are a family of biorthogonal wavelets that was made popular by Ingrid Daubechies. These are not the same as the orthogonal Daubechies wavelets, and also not very similar in shape and properties. However, their construction idea is the same.

Bandwidth expansion is a technique for widening the bandwidth or the resonances in an LPC filter. This is done by moving all the poles towards the origin by a constant factor . The bandwidth-expanded filter can be easily derived from the original filter by:

Fumitada Itakura is a Japanese scientist. He did pioneering work in statistical signal processing, and its application to speech analysis, synthesis and coding, including the development of the linear predictive coding (LPC) and line spectral pairs (LSP) methods.

The log-spectral distance (LSD), also referred to as log-spectral distortion or root mean square log-spectral distance, is a distance measure(expressed in dB) between two spectra. The log-spectral distance between spectra and is defined as:

Codec 2 is a low-bitrate speech audio codec that is patent free and open source. Codec 2 compresses speech using sinusoidal coding, a method specialized for human speech. Bit rates of 3200 to 450 bit/s have been successfully created. Codec 2 was designed to be used for amateur radio and other high compression voice applications.

<span class="mw-page-title-main">Audio coding format</span> Digitally coded format for audio signals

An audio coding format is a content representation format for storage or transmission of digital audio. Examples of audio coding formats include MP3, AAC, Vorbis, FLAC, and Opus. A specific software or hardware implementation capable of audio compression and decompression to/from a specific audio coding format is called an audio codec; an example of an audio codec is LAME, which is one of several different codecs which implements encoding and decoding audio in the MP3 audio coding format in software.

References

  1. Sahidullah, Md.; Chakroborty, Sandipan; Saha, Goutam (Jan 2010). "On the use of perceptual Line Spectral pairs Frequencies and higher-order residual moments for Speaker Identification". International Journal of Biometrics. 2 (4): 358–378. doi:10.1504/ijbm.2010.035450.
  2. Zheng, F.; Song, Z.; Li, L.; Yu, W. (1998). "The Distance Measure for Line Spectrum Pairs Applied to Speech Recognition" (PDF). Proceedings of the 5th International Conference on Spoken Language Processing (ICSLP'98) (3): 1123–6.
  3. 1 2 "List of IEEE Milestones". IEEE . Retrieved 15 July 2019.
  4. "Fumitada Itakura Oral History". IEEE Global History Network. 20 May 2009. Retrieved 2009-07-21.
  5. http://svr-www.eng.cam.ac.uk/~ajr/SpeechAnalysis/node51.html#SECTION000713000000000000000 Tony Robinson: Speech Analysis
  6. e.g. lsf.c in http://www.ietf.org/rfc/rfc3951.txt