PSOLA

Last updated
Oscillograms, spectrograms and intonograms of Polish expression (a) "jajem" [egg] (b) "ja jem" [I'm eating] (c) "nawoz" [fertiliser] (d) "na woz" [on a cart] Analiza cech suprasegmentalnych jezyka polskiego Fig.7.1 (p.63).jpg
Oscillograms, spectrograms and intonograms of Polish expression (a) "jajem" [egg] (b) "ja jem" [I'm eating] (c) "nawóz" [fertiliser] (d) "na wóz" [on a cart]

PSOLA (Pitch Synchronous Overlap and Add) is a digital signal processing technique used for speech processing and more specifically speech synthesis. It can be used to modify the pitch and duration of a speech signal. It was invented around 1986. [2]

Signal processing models and analyzes data representations of physical events

Signal processing is a subfield of electrical engineering that concerns the analysis, synthesis, and modification of signals, which are broadly defined as functions conveying "information about the behavior or attributes of some phenomenon", such as sound, images, and biological measurements. For example, signal processing techniques are used to improve signal transmission fidelity, storage efficiency, and subjective quality, and to emphasize or detect components of interest in a measured signal.

Speech processing is the study of speech signals and the processing methods of signals. The signals are usually processed in a digital representation, so speech processing can be regarded as a special case of digital signal processing, applied to speech signals. Aspects of speech processing includes the acquisition, manipulation, storage, transfer and output of speech signals. The input is called speech recognition and the output is called speech synthesis.

Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech computer or speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech.

Contents

PSOLA works by dividing the speech waveform in small overlapping segments. To change the pitch of the signal, the segments are moved further apart (to decrease the pitch) or closer together (to increase the pitch). To change the duration of the signal, the segments are then repeated multiple times (to increase the duration) or some are eliminated (to decrease the duration). The segments are then combined using the overlap add technique.

PSOLA can be used to change the prosody of a speech signal.

In linguistics, prosody is concerned with those elements of speech that are not individual phonetic segments but are properties of syllables and larger units of speech, including linguistic functions such as intonation, tone, stress, and rhythm.

See also

Related Research Articles

Additive synthesis is a sound synthesis technique that creates timbre by adding sine waves together.

Audio signal processing is a subfield of signal processing that is concerned with the electronic manipulation of audio signals. Audio signals are electronic representations of sound waves—longitudinal waves which travel through air, consisting of compressions and rarefactions. The energy contained in audio signals is typically measured in decibels. As audio signals may be represented in either digital or analog format, processing may occur in either domain. Analog processors operate directly on the electrical signal, while digital processors operate mathematically on its digital representation.

Subtractive synthesis is a method of sound synthesis in which partials of an audio signal are attenuated by a filter to alter the timbre of the sound. While subtractive synthesis can be applied to any source audio signal, the sound most commonly associated with the technique is that of analog synthesizers of the 1960s and 1970s, in which the harmonics of simple waveforms such as sawtooth, pulse or square waves are attenuated with a voltage-controlled resonant low-pass filter. Many digital, virtual analog and software synthesizers use subtractive synthesis, sometimes in conjunction with other methods of sound synthesis.

Speech coding is an application of data compression of digital audio signals containing speech. Speech coding uses speech-specific parameter estimation using audio signal processing techniques to model the speech signal, combined with generic data compression algorithms to represent the resulting modeled parameters in a compact bitstream.

Vocoder

A vocoder is a category of voice codec that analyzes and synthesizes the human voice signal for audio data compression, multiplexing, voice encryption, voice transformation, etc.

Delta modulation

A delta modulation is an analog-to-digital and digital-to-analog signal conversion technique used for transmission of voice information where quality is not of primary importance. DM is the simplest form of differential pulse-code modulation (DPCM) where the difference between successive samples are encoded into n-bit data streams. In delta modulation, the transmitted data are reduced to a 1-bit data stream. Its main features are:

Time stretching is the process of changing the speed or duration of an audio signal without affecting its pitch.

Waveform the shape and form of a signal such as a wave moving in a physical medium or an abstract representation

In electronics, acoustics, and related fields, the waveform of a signal is the shape of its graph as a function of time, independent of its time and magnitude scales and of any displacement in time.

Wavetable synthesis is a sound synthesis technique used to create periodic waveforms. Often used in the production of musical tones or notes, it was developed by Wolfgang Palm of Palm Products GmbH (PPG) in the late 1970s and published in 1979, and has since been used as the primary synthesis method in synthesizers built by PPG and Waldorf Music and as an auxiliary synthesis method by Ensoniq and Access. It is currently used in software-based synthesizers for PCs and tablets, including apps offered by PPG and Waldorf, among others.

Granular synthesis is a basic sound synthesis method that operates on the microsound time scale.

Synthesis or synthesize may also refer to:

In pulsed radar and sonar signal processing, an ambiguity function is a two-dimensional function of time delay and Doppler frequency showing the distortion of a returned pulse due to the receiver matched filter due to the Doppler shift of the return from a moving target. The ambiguity function is determined by the properties of the pulse and the matched filter, and not any particular target scenario. Many definitions of the ambiguity function exist; Some are restricted to narrowband signals and others are suitable to describe the propagation delay and Doppler relationship of wideband signals. Often the definition of the ambiguity function is given as the magnitude squared of other definitions (Weiss). For a given complex baseband pulse , the narrowband ambiguity function is given by

Harmonic Vector Excitation Coding, abbreviated as HVXC is a speech coding algorithm specified in MPEG-4 Part 3 standard for very low bit rate speech coding. HVXC supports bit rates of 2 and 4 kbit/s in the fixed and variable bit rate mode and sampling frequency 8 kHz. It also operates at lower bitrates, such as 1.2 - 1.7 kbit/s, using a variable bit rate technique. The total algorithmic delay for the encoder and decoder is 36 ms.

A pitch detection algorithm (PDA) is an algorithm designed to estimate the pitch or fundamental frequency of a quasiperiodic or oscillating signal, usually a digital recording of speech or a musical note or tone. This can be done in the time domain, the frequency domain, or both.

Pitch shift sound recording technique

Pitch shifting is a sound recording technique in which the original pitch of a sound is raised or lowered. Effects units that raise or lower pitch by a pre-designated musical interval (transposition) are called pitch shifters or pitch benders.

MBROLA is speech synthesis software as a worldwide collaborative project. The MBROLA project web page provides diphone databases for a large number of spoken languages.

The icophone is an instrument of speech synthesis conceived by Émile Leipp in 1964 and used for synthesizing the French language. The two first icophones were made in the laboratory of physical mechanics of Saint-Cyr-l'École.

CereProc

CereProc is a speech synthesis company based in Edinburgh, Scotland, founded in 2005. The company specialises in creating natural and expressive-sounding text to speech voices, synthesis voices with regional accents, and in voice cloning.

Sinsy (しぃんしぃ) is an online Hidden Markov model (HMM)-based singing voice synthesis system by the Nagoya Institute of Technology that was created under the Modified BSD license.

References

  1. Grazyna Demenko (1999). Analiza cech suprasegmentalnych jezyka polskiego na potrzeby technologii mowy (PDF) (Ph.D. thesis). Seria Jezykoznawstwo Stosowane. 17. Uniwersytet Im. Adama Mickiewicza W Poznaniu. Fig.7.1, p.63.
  2. Charpentier, F.; Stella, M. (Apr 1986). "Diphone synthesis using an overlap-add technique for speech waveforms concatenation". Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP'86. 11: 2015–2018. doi:10.1109/ICASSP.1986.1168657.