Harmonic and Individual Lines and Noise

Last updated

Harmonic and Individual Lines and Noise (HILN) is a parametric codec for audio. The basic premise of the encoder is that most audio, and particularly speech, can be synthesized from only sinusoids and noise. The encoder describes individual sinusoids with amplitude and frequency, harmonic tones by fundamental frequency, amplitude and the spectral envelope of the partials, and the noise by amplitude and spectral envelope. This type of encoder is capable of encoding audio to between 6 and 16 kilobits per second for a typical audio bandwidth of 8 kHz. The framelength of this encoder is 32 ms.

A codec is a device or computer program for encoding or decoding a digital data stream or signal. Codec is a portmanteau of coder-decoder.

Sound mechanical wave that is an oscillation of pressure transmitted through a solid, liquid, or gas, composed of frequencies within the range of hearing; pressure wave, generated by vibrating structure

In physics, sound is a vibration that typically propagates as an audible wave of pressure, through a transmission medium such as a gas, liquid or solid.

Sine wave Mathematical curve that describes a smooth repetitive oscillation; continuous wave

A sine wave or sinusoid is a mathematical curve that describes a smooth periodic oscillation. A sine wave is a continuous wave. It is named after the function sine, of which it is the graph. It occurs often in pure and applied mathematics, as well as physics, engineering, signal processing and many other fields. Its most basic form as a function of time (t) is:

A typical codec extracts sinusoid information from the samples by applying a short-time Fourier transform to the samples and using that to find the important harmonic content of a single frame. By matching sinusoids across frames, the encoder is capable of grouping them into harmonic lines and individual sinusoids. The matching can take amplitude, frequency and phase into account when trying to match sinusoids across frames. Differences between amplitude and frequency within a track can be coded with fewer bits than each individual single sinusoid would require, thus the longer a track the encoder can find, the better it will be able to reduce the final bitrate.

Short-time Fourier transform

The short-time Fourier transform (STFT), is a Fourier-related transform used to determine the sinusoidal frequency and phase content of local sections of a signal as it changes over time. In practice, the procedure for computing STFTs is to divide a longer time signal into shorter segments of equal length and then compute the Fourier transform separately on each shorter segment. This reveals the Fourier spectrum on each shorter segment. One then usually plots the changing spectra as a function of time.

The decoder uses an add-and-overlap strategy: each frame in the bitstream contains parameters for 32 ms, however, the next frame starts halfway through the current frame. By filtering the synthesized segments with a Hanning filter, adding two overlapping frames together will produce a smooth transition between the two. This also applies to the encoder because the short Fourier transform gives better results when the data is pre-processed with a Hanning filter.

Synthesizing only the sinusoids sounds artificial and metallic. To mask this, the encoder subtracts the synthesized sinusoids from the original audio signal. The residual is then matched to a linear filter that is excited with white noise. The extracted parameters can then be quantized, coded and multiplexed into a bitstream.

Audio filter

An audio filter is a frequency dependent amplifier circuit, working in the audio frequency range, 0 Hz to beyond 20 kHz. Audio filters can amplify (boost), pass or attenuate (cut) some frequency ranges. Many types of filters exist for different audio applications including hi-fi stereo systems, musical synthesizers, sound effects, sound reinforcement systems, instrument amplifiers and virtual reality systems.

White noise random signal having equal intensity at different frequencies, giving it a constant power spectral density

In signal processing, white noise is a random signal having equal intensity at different frequencies, giving it a constant power spectral density. The term is used, with this or similar meanings, in many scientific and technical disciplines, including physics, acoustical engineering, telecommunications, and statistical forecasting. White noise refers to a statistical model for signals and signal sources, rather than to any specific signal. White noise draws its name from white light, although light that appears white generally does not have a flat power spectral density over the visible band.

A bitstream, also known as binary sequence, is a sequence of bits.

Related Research Articles

Additive synthesis is a sound synthesis technique that creates timbre by adding sine waves together.

MPEG-1 is a standard for lossy compression of video and audio. It is designed to compress VHS-quality raw digital video and CD audio down to 1.5 Mbit/s without excessive quality loss, making video CDs, digital cable/satellite TV and digital audio broadcasting (DAB) possible.

Subtractive synthesis is a method of sound synthesis in which partials of an audio signal are attenuated by a filter to alter the timbre of the sound. While subtractive synthesis can be applied to any source audio signal, the sound most commonly associated with the technique is that of analog synthesizers of the 1960s and 1970s, in which the harmonics of simple waveforms such as sawtooth, pulse or square waves are attenuated with a voltage-controlled resonant low-pass filter. Many digital, virtual analog and software synthesizers use subtractive synthesis, sometimes in conjunction with other methods of sound synthesis.

The amplitude of a periodic variable is a measure of its change over a single period. There are various definitions of amplitude, which are all functions of the magnitude of the difference between the variable's extreme values. In older texts the phase is sometimes called the amplitude.

Time stretching is the process of changing the speed or duration of an audio signal without affecting its pitch.

Aliasing Effect that causes signals to become indistinguishable when sampled; often producing detectable aberrations

In signal processing and related disciplines, aliasing is an effect that causes different signals to become indistinguishable when sampled. It also refers to the distortion or artifact that results when the signal reconstructed from samples is different from the original continuous signal.

Window function

In signal processing and statistics, a window function is a mathematical function that is zero-valued outside of some chosen interval, normally symmetric around the middle of the interval, usually near a maximum in the middle, and usually tapering away from the middle. Mathematically, when another function or waveform/data-sequence is "multiplied" by a window function, the product is also zero-valued outside the interval: all that is left is the part where they overlap, the "view through the window". Equivalently, and in actual practice, the segment of data within the window is first isolated, and then only that data is multiplied by the window function values. Thus, tapering, not segmentation, is the main purpose of window functions.

Spectrum analyzer

A spectrum analyzer measures the magnitude of an input signal versus frequency within the full frequency range of the instrument. The primary use is to measure the power of the spectrum of known and unknown signals. The input signal that a spectrum analyzer measures is electrical; however, spectral compositions of other signals, such as acoustic pressure waves and optical light waves, can be considered through the use of an appropriate transducer. Optical spectrum analyzers also exist, which use direct optical techniques such as a monochromator to make measurements.

The Adaptive Multi-Rate audio codec is an audio compression format optimized for speech coding. AMR speech codec consists of a multi-rate narrowband speech codec that encodes narrowband (200–3400 Hz) signals at variable bit rates ranging from 4.75 to 12.2 kbit/s with toll quality speech starting at 7.4 kbit/s.

Harmonic Vector Excitation Coding, abbreviated as HVXC is a speech coding algorithm specified in MPEG-4 Part 3 standard for very low bit rate speech coding. HVXC supports bit rates of 2 and 4 kbit/s in the fixed and variable bit rate mode and sampling frequency 8 kHz. It also operates at lower bitrates, such as 1.2 - 1.7 kbit/s, using a variable bit rate technique. The total algorithmic delay for the encoder and decoder is 36 ms.

A phase vocoder is a type of vocoder which can scale both the frequency and time domains of audio signals by using phase information. The computer algorithm allows frequency-domain modifications to a digital sound file.

Dolby Digital Plus, also known as Enhanced AC-3 is a digital audio compression scheme developed by Dolby Labs for transport and storage of multi-channel digital audio. It is a successor to Dolby Digital (AC-3), also developed by Dolby, and has a number of improvements including support for a wider range of data rates, increased channel count and multi-program support, and additional tools (algorithms) for representing compressed data and counteracting artifacts. While Dolby Digital (AC-3) supports up to five full-bandwidth audio channels at a maximum bitrate of 640 Kbit/s, E-AC-3 supports up to 15 full-bandwidth audio channels at a maximum bitrate of 6.144 Mbit/s.

Spectrum continuation analysis (SCA) is a generalization of the concept of Fourier series to non-periodic functions of which only a fragment has been sampled in the time domain.

Bandwidth extension of signal is defined as the deliberate process of expanding the frequency range (bandwidth) of a signal in which it contains an appreciable and useful content, and/or the frequency range in which its effects are such. Its significant advancement in recent years has led to the technology being adopted commercially in several areas including psychacoustic bass enhancement of small loudspeakers and the high frequency enhancement of coded speech and audio.

In statistical signal processing, the goal of spectral density estimation (SDE) is to estimate the spectral density of a random signal from a sequence of time samples of the signal. Intuitively speaking, the spectral density characterizes the frequency content of the signal. One purpose of estimating the spectral density is to detect any periodicities in the data, by observing peaks at the frequencies corresponding to these periodicities.

G.718 is an ITU-T recommendation embedded scalable speech and audio codec providing high quality narrowband speech over the lower bit rates and high quality wideband speech over the complete range of bit rates. In addition, G.718 is designed to be highly robust to frame erasures, thereby enhancing the speech quality when used in internet protocol (IP) transport applications on fixed, wireless and mobile networks. Despite its embedded nature, the codec also performs well with both narrowband and wideband generic audio signals. The codec has an embedded scalable structure, enabling maximum flexibility in the transport of voice packets through IP networks of today and in future media-aware networks. In addition, the embedded structure of G.718 will easily allow the codec to be extended to provide a superwideband and stereo capability through additional layers which are currently under development in ITU-T Study Group 16. The bitstream may be truncated at the decoder side or by any component of the communication system to instantaneously adjust the bit rate to the desired value without the need for out-of-band signalling. The encoder produces an embedded bitstream structured in five layers corresponding to the five available bit rates: 8, 12, 16, 24 & 32 kbit/s.

Constrained Energy Lapped Transform (CELT) is an open, royalty-free lossy audio compression format and a free software codec with especially low algorithmic delay for use in low-latency audio communication. The algorithms are openly documented and may be used free of software patent restrictions. Development of the format was maintained by the Xiph.Org Foundation and later coordinated by the Opus working group of the Internet Engineering Task Force (IETF).

Harmonic pitch class profiles (HPCP) is a group of features that a computer program extracts from an audio signal, based on a pitch class profile—a descriptor proposed in the context of a chord recognition system. HPCP are an enhanced pitch distribution feature that are sequences of feature vectors that, to a certain extent, describe tonality, measuring the relative intensity of each of the 12 pitch classes of the equal-tempered scale within an analysis frame. Often, the twelve pitch spelling attributes are also referred to as chroma and the HPCP features are closely related to what is called chroma features or chromagrams.

Opus (audio format) audio compression format

Opus is a lossy audio coding format developed by the Xiph.Org Foundation and standardized by the Internet Engineering Task Force, designed to efficiently code speech and general audio in a single format, while remaining low-latency enough for real-time interactive communication and low-complexity enough for low-end embedded processors. Opus replaces both Vorbis and Speex for new applications, and several blind listening tests have ranked it higher-quality than any other standard audio format at any given bitrate until transparency is reached, including MP3, AAC, and HE-AAC.

Codec 2 is a low-bitrate speech audio codec that is patent free and open source. Codec 2 compresses speech using sinusoidal coding, a method specialized for human speech. Bit rates of 3200 to 450 bit/s have been successfully created. Codec 2 was designed to be used for amateur radio and other high compression voice applications.