Linear predictive coding

Last updated

Linear predictive coding (LPC) is a method used mostly in audio signal processing and speech processing for representing the spectral envelope of a digital signal of speech in compressed form, using the information of a linear predictive model. [1] [2]

Contents

LPC is the most widely used method in speech coding and speech synthesis. It is a powerful speech analysis technique, and a useful method for encoding good quality speech at a low bit rate.

Overview

LPC starts with the assumption that a speech signal is produced by a buzzer at the end of a tube (for voiced sounds), with occasional added hissing and popping sounds (for voiceless sounds such as sibilants and plosives). Although apparently crude, this Source–filter model is actually a close approximation of the reality of speech production. The glottis (the space between the vocal folds) produces the buzz, which is characterized by its intensity (loudness) and frequency (pitch). The vocal tract (the throat and mouth) forms the tube, which is characterized by its resonances; these resonances give rise to formants, or enhanced frequency bands in the sound produced. Hisses and pops are generated by the action of the tongue, lips and throat during sibilants and plosives.

LPC analyzes the speech signal by estimating the formants, removing their effects from the speech signal, and estimating the intensity and frequency of the remaining buzz. The process of removing the formants is called inverse filtering, and the remaining signal after the subtraction of the filtered modeled signal is called the residue.

The numbers which describe the intensity and frequency of the buzz, the formants, and the residue signal, can be stored or transmitted somewhere else. LPC synthesizes the speech signal by reversing the process: use the buzz parameters and the residue to create a source signal, use the formants to create a filter (which represents the tube), and run the source through the filter, resulting in speech.

Because speech signals vary with time, this process is done on short chunks of the speech signal, which are called frames; generally, 30 to 50 frames per second give an intelligible speech with good compression.

Early history

Linear prediction (signal estimation) goes back to at least the 1940s when Norbert Wiener developed a mathematical theory for calculating the best filters and predictors for detecting signals hidden in noise. [3] [4] Soon after Claude Shannon established a general theory of coding, work on predictive coding was done by C. Chapin Cutler, [5] Bernard M. Oliver [6] and Henry C. Harrison. [7] Peter Elias in 1955 published two papers on predictive coding of signals. [8] [9]

Linear predictors were applied to speech analysis independently by Fumitada Itakura of Nagoya University and Shuzo Saito of Nippon Telegraph and Telephone in 1966 and in 1967 by Bishnu S. Atal, Manfred R. Schroeder and John Burg. Itakura and Saito described a statistical approach based on maximum likelihood estimation; Atal and Schroeder described an adaptive linear predictor approach; Burg outlined an approach based on principle of maximum entropy. [4] [10] [11] [12]

In 1969, Itakura and Saito introduced method based on partial correlation (PARCOR), Glen Culler proposed real-time speech encoding, and Bishnu S. Atal presented an LPC speech coder at the Annual Meeting of the Acoustical Society of America. In 1971, realtime LPC using 16-bit LPC hardware was demonstrated by Philco-Ford; four units were sold. [13] LPC technology was advanced by Bishnu Atal and Manfred Schroeder during the 1970s1980s. [13] In 1978, Atal and Vishwanath et al. of BBN developed the first variable-rate LPC algorithm. [13] The same year, Atal and Manfred R. Schroeder at Bell Labs proposed an LPC speech codec called adaptive predictive coding, which used a psychoacoustic coding algorithm exploiting the masking properties of the human ear. [14] [15] This later became the basis for the perceptual coding technique used by the MP3 audio compression format, introduced in 1993. [14] Code-excited linear prediction (CELP) was developed by Schroeder and Atal in 1985. [16]

LPC is the basis for voice-over-IP (VoIP) technology. [13] In 1972, Bob Kahn of ARPA with Jim Forgie of Lincoln Laboratory (LL) and Dave Walden of BBN Technologies started the first developments in packetized speech, which would eventually lead to voice-over-IP technology. In 1973, according to Lincoln Laboratory informal history, the first real-time 2400  bit/s LPC was implemented by Ed Hofstetter. In 1974, the first real-time two-way LPC packet speech communication was accomplished over the ARPANET at 3500 bit/s between Culler-Harrison and Lincoln Laboratory. In 1976, the first LPC conference took place over the ARPANET using the Network Voice Protocol, between Culler-Harrison, ISI, SRI, and LL at 3500 bit/s.[ citation needed ][ clarification needed ]

LPC coefficient representations

LPC is frequently used for transmitting spectral envelope information, and as such it has to be tolerant of transmission errors. Transmission of the filter coefficients directly (see linear prediction for a definition of coefficients) is undesirable, since they are very sensitive to errors. In other words, a very small error can distort the whole spectrum, or worse, a small error might make the prediction filter unstable.

There are more advanced representations such as log area ratios (LAR), line spectral pairs (LSP) decomposition and reflection coefficients. Of these, especially LSP decomposition has gained popularity since it ensures the stability of the predictor, and spectral errors are local for small coefficient deviations.

Applications

LPC is the most widely used method in speech coding and speech synthesis. [17] It is generally used for speech analysis and resynthesis. It is used as a form of voice compression by phone companies, such as in the GSM standard, for example. It is also used for secure wireless, where voice must be digitized, encrypted and sent over a narrow voice channel; an early example of this is the US government's Navajo I.

LPC synthesis can be used to construct vocoders where musical instruments are used as an excitation signal to the time-varying filter estimated from a singer's speech. This is somewhat popular in electronic music. Paul Lansky made the well-known computer music piece notjustmoreidlechatter using linear predictive coding. [18] A 10th-order LPC was used in the popular 1980s Speak & Spell educational toy.

LPC predictors are used in Shorten, MPEG-4 ALS, FLAC, SILK audio codec, and other lossless audio codecs.

LPC has received some attention as a tool for use in the tonal analysis of violins and other stringed musical instruments. [19]

See also

Related Research Articles

Audio signal processing is a subfield of signal processing that is concerned with the electronic manipulation of audio signals. Audio signals are electronic representations of sound waves—longitudinal waves which travel through air, consisting of compressions and rarefactions. The energy contained in audio signals or sound power level is typically measured in decibels. As audio signals may be represented in either digital or analog format, processing may occur in either domain. Analog processors operate directly on the electrical signal, while digital processors operate mathematically on its digital representation.

In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compression reduces bits by identifying and eliminating statistical redundancy. No information is lost in lossless compression. Lossy compression reduces bits by removing unnecessary or less important information. Typically, a device that performs data compression is referred to as an encoder, and one that performs the reversal of the process (decompression) as a decoder.

Linear prediction is a mathematical operation where future values of a discrete-time signal are estimated as a linear function of previous samples.

Speech processing is the study of speech signals and the processing methods of signals. The signals are usually processed in a digital representation, so speech processing can be regarded as a special case of digital signal processing, applied to speech signals. Aspects of speech processing includes the acquisition, manipulation, storage, transfer and output of speech signals. Different speech processing tasks include speech recognition, speech synthesis, speaker diarization, speech enhancement, speaker recognition, etc.

Speech coding is an application of data compression to digital audio signals containing speech. Speech coding uses speech-specific parameter estimation using audio signal processing techniques to model the speech signal, combined with generic data compression algorithms to represent the resulting modeled parameters in a compact bitstream.

<span class="mw-page-title-main">Vocoder</span> Voice encryption, transformation, and synthesis device

A vocoder is a category of speech coding that analyzes and synthesizes the human voice signal for audio data compression, multiplexing, voice encryption or voice transformation.

<span class="mw-page-title-main">Digital audio</span> Technology that records, stores, and reproduces sound

Digital audio is a representation of sound recorded in, or converted into, digital form. In digital audio, the sound wave of the audio signal is typically encoded as numerical samples in a continuous sequence. For example, in CD audio, samples are taken 44,100 times per second, each with 16-bit resolution. Digital audio is also the name for the entire technology of sound recording and reproduction using audio signals that have been encoded in digital form. Following significant advances in digital audio technology during the 1970s and 1980s, it gradually replaced analog audio technology in many areas of audio engineering, record production and telecommunications in the 1990s and 2000s.

Algebraic code-excited linear prediction (ACELP) is a speech coding algorithm in which a limited set of pulses is distributed as excitation to a linear prediction filter. It is a linear predictive coding (LPC) algorithm that is based on the code-excited linear prediction (CELP) method and has an algebraic structure. ACELP was developed in 1989 by the researchers at the Université de Sherbrooke in Canada.

Mixed-excitation linear prediction (MELP) is a United States Department of Defense speech coding standard used mainly in military applications and satellite communications, secure voice, and secure radio devices. Its standardization and later development was led and supported by the NSA and NATO. The current "enhanced" version is known as MELPe.

Harmonic Vector Excitation Coding, abbreviated as HVXC is a speech coding algorithm specified in MPEG-4 Part 3 standard for very low bit rate speech coding. HVXC supports bit rates of 2 and 4 kbit/s in the fixed and variable bit rate mode and sampling frequency of 8 kHz. It also operates at lower bitrates, such as 1.2 - 1.7 kbit/s, using a variable bit rate technique. The total algorithmic delay for the encoder and decoder is 36 ms.

Code-excited linear prediction (CELP) is a linear predictive speech coding algorithm originally proposed by Manfred R. Schroeder and Bishnu S. Atal in 1985. At the time, it provided significantly better quality than existing low bit-rate algorithms, such as residual-excited linear prediction (RELP) and linear predictive coding (LPC) vocoders. Along with its variants, such as algebraic CELP, relaxed CELP, low-delay CELP and vector sum excited linear prediction, it is currently the most widely used speech coding algorithm. It is also used in MPEG-4 Audio speech coding. CELP is commonly used as a generic term for a class of algorithms and not for a particular codec.

<span class="mw-page-title-main">Secure voice</span> Encrypted voice communication

Secure voice is a term in cryptography for the encryption of voice communication over a range of communication types such as radio, telephone or IP.

Vector sum excited linear prediction (VSELP) is a speech coding method used in several cellular standards. The VSELP algorithm is an analysis-by-synthesis coding technique and belongs to the class of speech coding algorithms known as CELP.

Line spectral pairs (LSP) or line spectral frequencies (LSF) are used to represent linear prediction coefficients (LPC) for transmission over a channel. LSPs have several properties that make them superior to direct quantization of LPCs. For this reason, LSPs are very useful in speech coding.

Warped linear predictive coding is a variant of linear predictive coding in which the spectral representation of the system is modified, for example by replacing the unit delays used in an LPC implementation with first-order all-pass filters. This can have advantages in reducing the bitrate required for a given level of perceived audio quality/intelligibility, especially in wideband audio coding.

Bishnu S. Atal is an Indian physicist and engineer. He is a noted researcher in acoustics, and is best known for developments in speech coding. He advanced linear predictive coding (LPC) during the late 1960s to 1970s, and developed code-excited linear prediction (CELP) with Manfred R. Schroeder in 1985.

Fumitada Itakura is a Japanese scientist. He did pioneering work in statistical signal processing, and its application to speech analysis, synthesis and coding, including the development of the linear predictive coding (LPC) and line spectral pairs (LSP) methods.

<span class="mw-page-title-main">Manfred R. Schroeder</span> German physicist

Manfred Robert Schroeder was a German physicist, most known for his contributions to acoustics and computer graphics. He wrote three books and published over 150 articles in his field.

<span class="mw-page-title-main">Audio coding format</span> Digitally coded format for audio signals

An audio coding format is a content representation format for storage or transmission of digital audio. Examples of audio coding formats include MP3, AAC, Vorbis, FLAC, and Opus. A specific software or hardware implementation capable of audio compression and decompression to/from a specific audio coding format is called an audio codec; an example of an audio codec is LAME, which is one of several different codecs which implements encoding and decoding audio in the MP3 audio coding format in software.

<span class="mw-page-title-main">John Makhoul</span> American computer scientist

John Makhoul is a Lebanese-American computer scientist who works in the field of speech and language processing. Dr. Makhoul's work on linear predictive coding was used in the establishment of the Network Voice Protocol, which enabled the transmission of speech signals over the ARPANET. Makhoul is recognized in the field for his vital role in the areas of speech and language processing, including speech analysis, speech coding, speech recognition and speech understanding. He has made a number of significant contributions to the mathematical modeling of speech signals, including his work on linear prediction, and vector quantization. His patented work on the direct application of speech recognition techniques for accurate, language-independent optical character recognition (OCR) has had a dramatic impact on the ability to create OCR systems in multiple languages relatively quickly.

References

  1. Deng, Li; Douglas O'Shaughnessy (2003). Speech processing: a dynamic and optimization-oriented approach. Marcel Dekker. pp. 41–48. ISBN   978-0-8247-4040-5.
  2. Beigi, Homayoon (2011). Fundamentals of Speaker Recognition. Berlin: Springer-Verlag. ISBN   978-0-387-77591-3.
  3. B.S. Atal (2006). "The history of linear prediction". IEEE Signal Processing Magazine. 23 (2): 154–161. Bibcode:2006ISPM...23..154A. doi:10.1109/MSP.2006.1598091. S2CID   15601493.
  4. 1 2 Y. Sasahira; S. Hashimoto (1995). "Voice pitch changing by Linear Predictive Coding Method to keep the Singer's Personal Timbre" (pdf). Michigan Publishing.{{cite journal}}: Cite journal requires |journal= (help)
  5. US 2605361,C. C. Cutler,"Differential quantization of communication signals",published 1952-07-29
  6. B. M. Oliver (1952). "Efficient coding". The Bell System Technical Journal. 31 (4). Nokia Bell Labs: 724–750. doi:10.1002/j.1538-7305.1952.tb01403.x.
  7. H. C. Harrison (1952). "Experiments with linear prediction in television". Bell System Technical Journal. 31 (4): 764–783. doi:10.1002/j.1538-7305.1952.tb01405.x.
  8. P. Elias (1955). "Predictive coding I". IRE Trans. Inform.Theory. IT-1 no. 1: 16–24. doi:10.1109/TIT.1955.1055126.
  9. P. Elias (1955). "Predictive coding II". IRE Trans. Inform. Theory. IT-1 no. 1: 24–33. doi:10.1109/TIT.1955.1055116.
  10. S. Saito; F. Itakura (Jan 1967). "Theoretical consideration of the statistical optimum recognition of the spectral density of speech". J. Acoust. Soc.Japan.
  11. B.S. Atal; M.R. Schroeder (1967). "Predictive coding of speech". Conf. Communications and Proc.
  12. J.P. Burg (1967). "Maximum Entropy Spectral Analysis". Proceedings of 37th Meeting, Society of Exploration Geophysics, Oklahoma City.
  13. 1 2 3 4 Gray, Robert M. (2010). "A History of Realtime Digital Speech on Packet Networks: Part II of Linear Predictive Coding and the Internet Protocol" (PDF). Found. Trends Signal Process. 3 (4): 203–303. doi: 10.1561/2000000036 . ISSN   1932-8346. Archived (PDF) from the original on 2022-10-09.
  14. 1 2 Schroeder, Manfred R. (2014). "Bell Laboratories". Acoustics, Information, and Communication: Memorial Volume in Honor of Manfred R. Schroeder. Springer. p. 388. ISBN   9783319056609.
  15. Atal, B.; Schroeder, M. (1978). "Predictive coding of speech signals and subjective error criteria". ICASSP '78. IEEE International Conference on Acoustics, Speech, and Signal Processing. Vol. 3. pp. 573–576. doi:10.1109/ICASSP.1978.1170564.
  16. Schroeder, Manfred R.; Atal, Bishnu S. (1985). "Code-excited linear prediction(CELP): High-quality speech at very low bit rates". ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing. Vol. 10. pp. 937–940. doi:10.1109/ICASSP.1985.1168147. S2CID   14803427.
  17. Gupta, Shipra (May 2016). "Application of MFCC in Text Independent Speaker Recognition" (PDF). International Journal of Advanced Research in Computer Science and Software Engineering. 6 (5): 805–810 (806). ISSN   2277-128X. S2CID   212485331. Archived from the original (PDF) on 2019-10-18. Retrieved 18 October 2019.
  18. Lansky, Paul. "More Than Idle Chatter". Archived from the original on 2017-12-24. Retrieved 2024-06-02.
  19. Tai, Hwan-Ching; Chung, Dai-Ting (June 14, 2012). "Stradivari Violins Exhibit Formant Frequencies Resembling Vowels Produced by Females". Savart Journal. 1 (2).

Further reading