Spectral modeling synthesis

Last updated
Spectral modeling synthesis (based on Roads 1996, p. 153) Spectrum Modeling Synthesis overview (based on Curtis Roads 1996, Fig.4.23).jpg
Spectral modeling synthesis (based on Roads 1996, p. 153)

Spectral modeling synthesis (SMS) is an acoustic modeling approach for speech and other signals. SMS considers sounds as a combination of harmonic content and noise content. Harmonic components are identified based on peaks in the frequency spectrum of the signal, normally as found by the short-time Fourier transform. The signal that remains following removal of the spectral components, sometimes referred to as the residual, is then modeled as white noise passed through a time-varying filter. The output of the model, then, are the frequencies and levels of the detected harmonic components and the coefficients of the time-varying filter.

Intuitively, the model can be applied to many types of audio signals. Speech signals, for example, include slowly changing harmonic sounds caused by vibration of the vocal cords plus wideband, noise-like sounds caused by the lips and mouth. Musical instruments also produce sounds containing both harmonic components and percussive, noise-like sounds when the notes are struck or changed.

SMS analysis (Bonada et al. 2001, Fig.1).svg
SMS synthesis (Bonada et al. 2001, Fig.2).svg
SMS analysis & synthesis block diagrams (based on Bonada et al. 2001, Fig.1 & Fig.2)

See also

Related Research Articles

Additive synthesis is a sound synthesis technique that creates timbre by adding sine waves together.

Digital signal processing (DSP) is the use of digital processing, such as by computers or more specialized digital signal processors, to perform a wide variety of signal processing operations. The digital signals processed in this manner are a sequence of numbers that represent samples of a continuous variable in a domain such as time, space, or frequency. In digital electronics, a digital signal is represented as a pulse train, which is typically generated by the switching of a transistor.

<span class="mw-page-title-main">Formant</span> Spectrum of phonetic resonance in speech production, or its peak

In speech science and phonetics, a formant is the broad spectral maximum that results from an acoustic resonance of the human vocal tract. In acoustics, a formant is usually defined as a broad peak, or local maximum, in the spectrum. For harmonic sounds, with this definition, the formant frequency is sometimes taken as that of the harmonic that is most augmented by a resonance. The difference between these two definitions resides in whether "formants" characterise the production mechanisms of a sound or the produced sound itself. In practice, the frequency of a spectral peak differs slightly from the associated resonance frequency, except when, by luck, harmonics are aligned with the resonance frequency.

<span class="mw-page-title-main">Frequency modulation synthesis</span> Form of sound synthesis

Frequency modulation synthesis is a form of sound synthesis whereby the frequency of a waveform is changed by modulating its frequency with a modulator. The frequency of an oscillator is altered "in accordance with the amplitude of a modulating signal".

<span class="mw-page-title-main">Signal processing</span> Analysing, modifying and creating signals

Signal processing is an electrical engineering subfield that focuses on analyzing, modifying and synthesizing signals, such as sound, images, potential fields, seismic signals, altimetry processing, and scientific measurements. Signal processing techniques are used to optimize transmissions, digital storage efficiency, correcting distorted signals, subjective video quality and to also detect or pinpoint components of interest in a measured signal.

<span class="mw-page-title-main">Vocoder</span> Voice encryption, transformation, and synthesis device

A vocoder is a category of speech coding that analyzes and synthesizes the human voice signal for audio data compression, multiplexing, voice encryption or voice transformation.

Linear predictive coding (LPC) is a method used mostly in audio signal processing and speech processing for representing the spectral envelope of a digital signal of speech in compressed form, using the information of a linear predictive model.

In signal processing, distortion is the alteration of the original shape of a signal. In communications and electronics it means the alteration of the waveform of an information-bearing signal, such as an audio signal representing sound or a video signal representing images, in an electronic device or communication channel.

Flanging is an audio effect produced by mixing two identical signals together, one signal delayed by a small and (usually) gradually changing period, usually smaller than 20 milliseconds. This produces a swept comb filter effect: peaks and notches are produced in the resulting frequency spectrum, related to each other in a linear harmonic series. Varying the time delay causes these to sweep up and down the frequency spectrum. A flanger is an effects unit that creates this effect.

<span class="mw-page-title-main">Spectrogram</span> Visual representation of the spectrum of frequencies of a signal as it varies with time

A spectrogram is a visual representation of the spectrum of frequencies of a signal as it varies with time. When applied to an audio signal, spectrograms are sometimes called sonographs, voiceprints, or voicegrams. When the data are represented in a 3D plot they may be called waterfall displays.

<span class="mw-page-title-main">Spectrum analyzer</span> Electronic testing device

A spectrum analyzer measures the magnitude of an input signal versus frequency within the full frequency range of the instrument. The primary use is to measure the power of the spectrum of known and unknown signals. The input signal that most common spectrum analyzers measure is electrical; however, spectral compositions of other signals, such as acoustic pressure waves and optical light waves, can be considered through the use of an appropriate transducer. Spectrum analyzers for other types of signals also exist, such as optical spectrum analyzers which use direct optical techniques such as a monochromator to make measurements.

<span class="mw-page-title-main">Frequency domain</span> Signal representation

In physics, electronics, control systems engineering, and statistics, the frequency domain refers to the analysis of mathematical functions or signals with respect to frequency, rather than time. Put simply, a time-domain graph shows how a signal changes over time, whereas a frequency-domain graph shows how much of the signal lies within each given frequency band over a range of frequencies. A frequency-domain representation can also include information on the phase shift that must be applied to each sinusoid in order to be able to recombine the frequency components to recover the original time signal.

<span class="mw-page-title-main">Colors of noise</span> Power spectrum of a noise signal (a signal produced by a stochastic process)

In audio engineering, electronics, physics, and many other fields, the color of noise or noise spectrum refers to the power spectrum of a noise signal. Different colors of noise have significantly different properties. For example, as audio signals they will sound different to human ears, and as images they will have a visibly different texture. Therefore, each application typically requires noise of a specific color. This sense of 'color' for noise signals is similar to the concept of timbre in music.

The source–filter model represents speech as a combination of a sound source, such as the vocal cords, and a linear acoustic filter, the vocal tract. While only an approximation, the model is widely used in a number of applications such as speech synthesis and speech analysis because of its relative simplicity. It is also related to linear prediction. The development of the model is due, in large part, to the early work of Gunnar Fant, although others, notably Ken Stevens, have also contributed substantially to the models underlying acoustic analysis of speech and speech synthesis. Fant built off the work of Tsutomu Chiba and Masato Kajiyama, who first showed the relationship between a vowel's acoustic properties and the shape of the vocal tract.

Bandwidth extension of signal is defined as the deliberate process of expanding the frequency range (bandwidth) of a signal in which it contains an appreciable and useful content, and/or the frequency range in which its effects are such. Its significant advancement in recent years has led to the technology being adopted commercially in several areas including psychacoustic bass enhancement of small loudspeakers and the high frequency enhancement of coded speech and audio.

Computer audition (CA) or machine listening is general field of study of algorithms and systems for audio understanding by machine. Since the notion of what it means for a machine to "hear" is very broad and somewhat vague, computer audition attempts to bring together several disciplines that originally dealt with specific problems or had a concrete application in mind. The engineer Paris Smaragdis, interviewed in Technology Review, talks about these systems --"software that uses sound to locate people moving through rooms, monitor machinery for impending breakdowns, or activate traffic cameras to record accidents."

The method of reassignment is a technique for sharpening a time-frequency representation by mapping the data to time-frequency coordinates that are nearer to the true region of support of the analyzed signal. The method has been independently introduced by several parties under various names, including method of reassignment, remapping, time-frequency reassignment, and modified moving-window method. In the case of the spectrogram or the short-time Fourier transform, the method of reassignment sharpens blurry time-frequency data by relocating the data according to local estimates of instantaneous frequency and group delay. This mapping to reassigned time-frequency coordinates is very precise for signals that are separable in time and frequency with respect to the analysis window.

A frequency synthesizer is an electronic circuit that generates a range of frequencies from a single reference frequency. Frequency synthesizers are used in many modern devices such as radio receivers, televisions, mobile telephones, radiotelephones, walkie-talkies, CB radios, cable television converter boxes, satellite receivers, and GPS systems. A frequency synthesizer may use the techniques of frequency multiplication, frequency division, direct digital synthesis, frequency mixing, and phase-locked loops to generate its frequencies. The stability and accuracy of the frequency synthesizer's output are related to the stability and accuracy of its reference frequency input. Consequently, synthesizers use stable and accurate reference frequencies, such as those provided by a crystal oscillator.

In statistical signal processing, the goal of spectral density estimation (SDE) or simply spectral estimation is to estimate the spectral density of a signal from a sequence of time samples of the signal. Intuitively speaking, the spectral density characterizes the frequency content of the signal. One purpose of estimating the spectral density is to detect any periodicities in the data, by observing peaks at the frequencies corresponding to these periodicities.

Harmonic pitch class profiles (HPCP) is a group of features that a computer program extracts from an audio signal, based on a pitch class profile—a descriptor proposed in the context of a chord recognition system. HPCP are an enhanced pitch distribution feature that are sequences of feature vectors that, to a certain extent, describe tonality, measuring the relative intensity of each of the 12 pitch classes of the equal-tempered scale within an analysis frame. Often, the twelve pitch spelling attributes are also referred to as chroma and the HPCP features are closely related to what is called chroma features or chromagrams.

References