Audio time stretching and pitch scaling

Last updated

Time stretching is the process of changing the speed or duration of an audio signal without affecting its pitch. Pitch scaling is the opposite: the process of changing the pitch without affecting the speed. Pitch shift is pitch scaling implemented in an effects unit and intended for live performance. Pitch control is a simpler process which affects pitch and speed simultaneously by slowing down or speeding up a recording.

Contents

These processes are often used to match the pitches and tempos of two pre-recorded clips for mixing when the clips cannot be reperformed or resampled. Time stretching is often used to adjust radio commercials [1] and the audio of television advertisements [2] to fit exactly into the 30 or 60 seconds available. It can be used to conform longer material to a designated time slot, such as a 1-hour broadcast.

Resampling

The simplest way to change the duration or pitch of an audio recording is to change the playback speed. For a digital audio recording, this can be accomplished through sample rate conversion. When using this method, the frequencies in the recording are always scaled at the same ratio as the speed, transposing its perceived pitch up or down in the process. Slowing down the recording to increase duration also lowers the pitch, while speeding it up for a shorter duration respectively raises the pitch, creating the so-called Chipmunk effect. When resampling audio to a notably lower pitch, it may be preferred that the source audio is of a higher sample rate, as slowing down the playback rate will reproduce an audio signal of a lower resolution, and therefore reduce the perceived clarity of the sound. On the contrary, when resampling audio to a notably higher pitch, it may be preferred to incorporate an interpolation filter, as frequencies that surpass the Nyquist frequency (determined by the sampling rate of the audio reproduction software or device) will create usually undesired sound distortions, a phenomenon that is also known as aliasing.

Frequency domain

Phase vocoder

One way of stretching the length of a signal without affecting the pitch is to build a phase vocoder after Flanagan, Golden, and Portnoff.

Basic steps:

  1. compute the instantaneous frequency/amplitude relationship of the signal using the STFT, which is the discrete Fourier transform of a short, overlapping and smoothly windowed block of samples;
  2. apply some processing to the Fourier transform magnitudes and phases (like resampling the FFT blocks); and
  3. perform an inverse STFT by taking the inverse Fourier transform on each chunk and adding the resulting waveform chunks, also called overlap and add (OLA). [3]

The phase vocoder handles sinusoid components well, but early implementations introduced considerable smearing on transient ("beat") waveforms at all non-integer compression/expansion rates, which renders the results phasey and diffuse. Recent improvements allow better quality results at all compression/expansion ratios but a residual smearing effect still remains.

The phase vocoder technique can also be used to perform pitch shifting, chorusing, timbre manipulation, harmonizing, and other unusual modifications, all of which can be changed as a function of time.

Sinusoidal analysis/synthesis system (based on McAulay & Quatieri 1988, p. 161) Sinusoidal Analysis & Synthesis (McAulay-Quatieri 1988).svg
Sinusoidal analysis/synthesis system (based on McAulay & Quatieri 1988 , p. 161)

Sinusoidal spectral modeling

Another method for time stretching relies on a spectral model of the signal. In this method, peaks are identified in frames using the STFT of the signal, and sinusoidal "tracks" are created by connecting peaks in adjacent frames. The tracks are then re-synthesized at a new time scale. This method can yield good results on both polyphonic and percussive material, especially when the signal is separated into sub-bands. However, this method is more computationally demanding than other methods.[ citation needed ]

Modelling a monophonic sound as observation along a helix of a function with a cylinder domain MonophonicSoundCylinderModel.svg
Modelling a monophonic sound as observation along a helix of a function with a cylinder domain

Time domain

SOLA

Rabiner and Schafer in 1978 put forth an alternate solution that works in the time domain: attempt to find the period (or equivalently the fundamental frequency) of a given section of the wave using some pitch detection algorithm (commonly the peak of the signal's autocorrelation, or sometimes cepstral processing), and crossfade one period into another.

This is called time-domain harmonic scaling [5] or the synchronized overlap-add method (SOLA) and performs somewhat faster than the phase vocoder on slower machines but fails when the autocorrelation mis-estimates the period of a signal with complicated harmonics (such as orchestral pieces).

Adobe Audition (formerly Cool Edit Pro) seems to solve this by looking for the period closest to a center period that the user specifies, which should be an integer multiple of the tempo, and between 30 Hz and the lowest bass frequency.

This is much more limited in scope than the phase vocoder based processing, but can be made much less processor intensive, for real-time applications. It provides the most coherent results[ citation needed ] for single-pitched sounds like voice or musically monophonic instrument recordings.

High-end commercial audio processing packages either combine the two techniques (for example by separating the signal into sinusoid and transient waveforms), or use other techniques based on the wavelet transform, or artificial neural network processing[ citation needed ], producing the highest-quality time stretching.

Frame-based approach

Frame-based approach of many TSM procedures GeneralizedPrinciple TSM.png
Frame-based approach of many TSM procedures

In order to preserve an audio signal's pitch when stretching or compressing its duration, many time-scale modification (TSM) procedures follow a frame-based approach. [6] Given an original discrete-time audio signal, this strategy's first step is to split the signal into short analysis frames of fixed length. The analysis frames are spaced by a fixed number of samples, called the analysis hopsize. To achieve the actual time-scale modification, the analysis frames are then temporally relocated to have a synthesis hopsize. This frame relocation results in a modification of the signal's duration by a stretching factor of . However, simply superimposing the unmodified analysis frames typically results in undesired artifacts such as phase discontinuities or amplitude fluctuations. To prevent these kinds of artifacts, the analysis frames are adapted to form synthesis frames, prior to the reconstruction of the time-scale modified output signal.

The strategy of how to derive the synthesis frames from the analysis frames is a key difference among different TSM procedures.

Speed hearing and speed talking

For the specific case of speech, time stretching can be performed using PSOLA.

Time-compressed speech is the representation of verbal text in compressed time. While one might expect speeding up to reduce comprehension, Herb Friedman says that "Experiments have shown that the brain works most efficiently if the information rate through the ears—via speech—is the 'average' reading rate, which is about 200–300 wpm (words per minute), yet the average rate of speech is in the neighborhood of 100–150 wpm." [7]

Listening to time-compressed speech is seen as the equivalent of speed reading.[ by whom? ] [8] [9]

Pitch scaling

H7600 Harmonizer Effects Processor by Eventide.tif
Pitch shifting (frequency scaling) is provided on Eventide Harmonizer
BodeFrequencyShifter.jpg
Frequency shifting provided by Bode Frequency Shifter does not keep frequency ratio and harmony.

These techniques can also be used to transpose an audio sample while holding speed or duration constant. This may be accomplished by time stretching and then resampling back to the original length. Alternatively, the frequency of the sinusoids in a sinusoidal model may be altered directly, and the signal reconstructed at the appropriate time scale.

Transposing can be called frequency scaling or pitch shifting , depending on perspective.

For example, one could move the pitch of every note up by a perfect fifth, keeping the tempo the same. One can view this transposition as "pitch shifting", "shifting" each note up 7 keys on a piano keyboard, or adding a fixed amount on the Mel scale, or adding a fixed amount in linear pitch space. One can view the same transposition as "frequency scaling", "scaling" (multiplying) the frequency of every note by 3/2.

Musical transposition preserves the ratios of the harmonic frequencies that determine the sound's timbre, unlike the frequency shift performed by amplitude modulation, which adds a fixed frequency offset to the frequency of every note. (In theory one could perform a literal pitch scaling in which the musical pitch space location is scaled [a higher note would be shifted at a greater interval in linear pitch space than a lower note], but that is highly unusual, and not musical.[ citation needed ])

Time domain processing works much better here, as smearing is less noticeable, but scaling vocal samples distorts the formants into a sort of Alvin and the Chipmunks-like effect, which may be desirable or undesirable. A process that preserves the formants and character of a voice involves analyzing the signal with a channel vocoder or LPC vocoder plus any of several pitch detection algorithms and then resynthesizing it at a different fundamental frequency.

A detailed description of older analog recording techniques for pitch shifting can be found at Alvin and the Chipmunks § Recording technique.

In consumer software

Pitch-corrected audio timestretch is found in every modern web browser as part of the HTML standard for media playback. [10] Similar controls are ubiquitous in media applications and frameworks such as GStreamer and Unity.

See also

Related Research Articles

<span class="mw-page-title-main">Vocoder</span> Voice encryption, transformation, and synthesis device

A vocoder is a category of speech coding that analyzes and synthesizes the human voice signal for audio data compression, multiplexing, voice encryption or voice transformation.

Linear predictive coding (LPC) is a method used mostly in audio signal processing and speech processing for representing the spectral envelope of a digital signal of speech in compressed form, using the information of a linear predictive model.

<span class="mw-page-title-main">Sampling (signal processing)</span> Measurement of a signal at discrete time intervals

In signal processing, sampling is the reduction of a continuous-time signal to a discrete-time signal. A common example is the conversion of a sound wave to a sequence of "samples". A sample is a value of the signal at a point in time and/or space; this definition differs from the term's usage in statistics, which refers to a set of such values.

<span class="mw-page-title-main">Spectrogram</span> Visual representation of the spectrum of frequencies of a signal as it varies with time

A spectrogram is a visual representation of the spectrum of frequencies of a signal as it varies with time. When applied to an audio signal, spectrograms are sometimes called sonographs, voiceprints, or voicegrams. When the data are represented in a 3D plot they may be called waterfall displays.

In signal processing and electronics, the frequency response of a system is the quantitative measure of the magnitude and phase of the output as a function of input frequency. The frequency response is widely used in the design and analysis of systems, such as audio and control systems, where they simplify mathematical analysis by converting governing differential equations into algebraic equations. In an audio system, it may be used to minimize audible distortion by designing components so that the overall response is as flat (uniform) as possible across the system's bandwidth. In control systems, such as a vehicle's cruise control, it may be used to assess system stability, often through the use of Bode plots. Systems with a specific frequency response can be designed using analog and digital filters.

<span class="mw-page-title-main">Sampler (musical instrument)</span> Device that records and plays back samples

A sampler is an electronic musical instrument that records and plays back samples. Samples may comprise elements such as rhythm, melody, speech, sound effects or longer portions of music.

<span class="mw-page-title-main">Short-time Fourier transform</span> Fourier-related transform suited to signals that change rather quickly in time

The short-time Fourier transform (STFT), is a Fourier-related transform used to determine the sinusoidal frequency and phase content of local sections of a signal as it changes over time. In practice, the procedure for computing STFTs is to divide a longer time signal into shorter segments of equal length and then compute the Fourier transform separately on each shorter segment. This reveals the Fourier spectrum on each shorter segment. One then usually plots the changing spectra as a function of time, known as a spectrogram or waterfall plot, such as commonly used in software defined radio (SDR) based spectrum displays. Full bandwidth displays covering the whole range of an SDR commonly use fast Fourier transforms (FFTs) with 2^24 points on desktop computers.

<span class="mw-page-title-main">Spectral band replication</span> Low bitrate digital audio enhancement technique

Spectral band replication (SBR) is a technology to enhance audio or speech codecs, especially at low bit rates and is based on harmonic redundancy in the frequency domain.

Harmonic Vector Excitation Coding, abbreviated as HVXC is a speech coding algorithm specified in MPEG-4 Part 3 standard for very low bit rate speech coding. HVXC supports bit rates of 2 and 4 kbit/s in the fixed and variable bit rate mode and sampling frequency of 8 kHz. It also operates at lower bitrates, such as 1.2 - 1.7 kbit/s, using a variable bit rate technique. The total algorithmic delay for the encoder and decoder is 36 ms.

A phase vocoder is a type of vocoder-purposed algorithm which can interpolate information present in the frequency and time domains of audio signals by using phase information extracted from a frequency transform. The computer algorithm allows frequency-domain modifications to a digital sound file.

<span class="mw-page-title-main">Secure voice</span> Encrypted voice communication

Secure voice is a term in cryptography for the encryption of voice communication over a range of communication types such as radio, telephone or IP.

In a mixed-signal system, a reconstruction filter, sometimes called an anti-imaging filter, is used to construct a smooth analog signal from a digital input, as in the case of a digital to analog converter (DAC) or other sampled data output device.

Vector sum excited linear prediction (VSELP) is a speech coding method used in several cellular standards. The VSELP algorithm is an analysis-by-synthesis coding technique and belongs to the class of speech coding algorithms known as CELP.

<span class="mw-page-title-main">Pitch shifting</span> Audio processing technique that changes the original pitch of a sound

Pitch shifting is a sound recording technique in which the original pitch of a sound is raised or lowered. Effects units that raise or lower pitch by a pre-designated musical interval (transposition) are called pitch shifters.

Sample-rate conversion, sampling-frequency conversion or resampling is the process of changing the sampling rate or sampling frequency of a discrete signal to obtain a new discrete representation of the underlying continuous signal. Application areas include image scaling and audio/visual systems, where different sampling rates may be used for engineering, economic, or historical reasons.

Several techniques can be used to move signals in the time-frequency distribution. Similar to computer graphic techniques, signals can be subjected to horizontal shifting, vertical shifting, dilation (scaling), shearing, rotation, and twisting. These techniques can help to save the bandwidth with proper motions apply on the signals. Moreover, filters with proper motion transformation can save the hardware cost without additional filters.

Time–frequency analysis for music signals is one of the applications of time–frequency analysis. Musical sound can be more complicated than human vocal sound, occupying a wider band of frequency. Music signals are time-varying signals; while the classic Fourier transform is not sufficient to analyze them, time–frequency analysis is an efficient tool for such use. Time–frequency analysis is extended from the classic Fourier approach. Short-time Fourier transform (STFT), Gabor transform (GT) and Wigner distribution function (WDF) are famous time–frequency methods, useful for analyzing music signals such as notes played on a piano, a flute or a guitar.

Perceptual Objective Listening Quality Analysis (POLQA) was the working title of an ITU-T standard that covers a model to predict speech quality by means of analyzing digital speech signals. The model was standardized as Recommendation ITU-T P.863 in 2011. The second edition of the standard appeared in 2014, and the third, currently in-force edition was adopted in 2018 under the title Perceptual objective listening quality prediction.

<span class="mw-page-title-main">Audio forensics</span>

Audio forensics is the field of forensic science relating to the acquisition, analysis, and evaluation of sound recordings that may ultimately be presented as admissible evidence in a court of law or some other official venue.

In digital audio editing, scrubbing is an interaction in which a user drags a cursor or playhead across a segment of a waveform to hear it. Scrubbing is a convenient way to quickly navigate an audio file, and is a common feature of modern digital audio workstations and other audio editing software. The term comes from the early days of the recording industry and refers to the process of physically moving tape reels to locate a specific point in the audio track; this gave the engineer the impression that the tape was being scrubbed, or cleaned.

References

  1. "Dolby, The Chipmunks And NAB2004". Archived from the original on 2008-05-27.{{cite magazine}}: Cite magazine requires |magazine= (help)
  2. "Variable speech". www.atarimagazines.com.
  3. Jont B. Allen (June 1977). "Short Time Spectral Analysis, Synthesis, and Modification by Discrete Fourier Transform". IEEE Transactions on Acoustics, Speech, and Signal Processing. ASSP-25 (3): 235–238.
  4. McAulay, R. J.; Quatieri, T. F. (1988), "Speech Processing Based on a Sinusoidal Model" (PDF), The Lincoln Laboratory Journal, 1 (2): 153–167, archived from the original (PDF) on 2012-05-21, retrieved 2014-09-07
  5. David Malah (April 1979). "Time-domain algorithms for harmonic bandwidth reduction and time scaling of speech signals". IEEE Transactions on Acoustics, Speech, and Signal Processing. ASSP-27 (2): 121–133.
  6. Jonathan Driedger and Meinard Müller (2016). "A Review of Time-Scale Modification of Music Signals". Applied Sciences. 6 (2): 57. doi: 10.3390/app6020057 .
  7. Variable Speech, Creative Computing Vol. 9, No. 7 / July 1983 / p. 122
  8. "Listen to podcasts in half the time". Archived from the original on 2011-08-29. Retrieved 2008-07-24.
  9. "Speeding iPods". Archived from the original on 2006-09-02.
  10. "HTMLMediaElement.playbackRate - Web APIs". MDN. Retrieved 1 September 2021.