PSOLA

Last updated
Oscillograms, spectrograms and intonograms of Polish expression (a) "jajem" [egg] (b) "ja jem" [I'm eating] (c) "nawoz" [fertiliser] (d) "na woz" [on a cart] Analiza cech suprasegmentalnych jezyka polskiego Fig.7.1 (p.63).jpg
Oscillograms, spectrograms and intonograms of Polish expression (a) "jajem" [egg] (b) "ja jem" [I'm eating] (c) "nawóz" [fertiliser] (d) "na wóz" [on a cart]

PSOLA (Pitch Synchronous Overlap and Add) is a digital signal processing technique used for speech processing and more specifically speech synthesis. It can be used to modify the pitch and duration of a speech signal. It was invented around 1986. [2]

Contents

PSOLA works by dividing the speech waveform in small overlapping segments. To change the pitch of the signal, the segments are moved further apart (to decrease the pitch) or closer together (to increase the pitch). To change the duration of the signal, the segments are then repeated multiple times (to increase the duration) or some are eliminated (to decrease the duration). The segments are then combined using the overlap add technique.

PSOLA can be used to change the prosody of a speech signal.

See also

Related Research Articles

Additive synthesis is a sound synthesis technique that creates timbre by adding sine waves together.

Audio signal processing is a subfield of signal processing that is concerned with the electronic manipulation of audio signals. Audio signals are electronic representations of sound waves—longitudinal waves which travel through air, consisting of compressions and rarefactions. The energy contained in audio signals or sound power level is typically measured in decibels. As audio signals may be represented in either digital or analog format, processing may occur in either domain. Analog processors operate directly on the electrical signal, while digital processors operate mathematically on its digital representation.

<span class="mw-page-title-main">Vocoder</span> Voice encryption, transformation, and synthesis device

A vocoder is a category of speech coding that analyzes and synthesizes the human voice signal for audio data compression, multiplexing, voice encryption or voice transformation.

Linear predictive coding (LPC) is a method used mostly in audio signal processing and speech processing for representing the spectral envelope of a digital signal of speech in compressed form, using the information of a linear predictive model.

<span class="mw-page-title-main">Delta modulation</span> Signal conversion technique

Delta modulation is an analog-to-digital and digital-to-analog signal conversion technique used for transmission of voice information where quality is not of primary importance. DM is the simplest form of differential pulse-code modulation (DPCM) where the difference between successive samples is encoded into n-bit data streams. In delta modulation, the transmitted data are reduced to a 1-bit data stream representing either up (↗) or down (↘). Its main features are:

Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech. The reverse process is speech recognition.

Time stretching is the process of changing the speed or duration of an audio signal without affecting its pitch. Pitch scaling is the opposite: the process of changing the pitch without affecting the speed. Pitch shift is pitch scaling implemented in an effects unit and intended for live performance. Pitch control is a simpler process which affects pitch and speed simultaneously by slowing down or speeding up a recording.

<span class="mw-page-title-main">Mel scale</span> Conceptual scale

The mel scale is a perceptual scale of pitches judged by listeners to be equal in distance from one another. The reference point between this scale and normal frequency measurement is defined by assigning a perceptual pitch of 1000 mels to a 1000 Hz tone, 40 dB above the listener's threshold. Above about 500 Hz, increasingly large intervals are judged by listeners to produce equal pitch increments.

<span class="mw-page-title-main">Aliasing</span> Signal processing effect

In signal processing and related disciplines, aliasing is the overlapping of frequency components resulting from a sample rate below the Nyquist rate. This overlap results in distortion or artifacts when the signal is reconstructed from samples which causes the reconstructed signal to differ from the original continuous signal. Aliasing that occurs in signals sampled in time, for instance in digital audio or the stroboscopic effect, is referred to as temporal aliasing. Aliasing in spatially sampled signals is referred to as spatial aliasing.

Granular synthesis is a sound synthesis method that operates on the microsound time scale.

<span class="mw-page-title-main">Spectrogram</span> Visual representation of the spectrum of frequencies of a signal as it varies with time

A spectrogram is a visual representation of the spectrum of frequencies of a signal as it varies with time. When applied to an audio signal, spectrograms are sometimes called sonographs, voiceprints, or voicegrams. When the data are represented in a 3D plot they may be called waterfall displays.

A pitch detection algorithm (PDA) is an algorithm designed to estimate the pitch or fundamental frequency of a quasiperiodic or oscillating signal, usually a digital recording of speech or a musical note or tone. This can be done in the time domain, the frequency domain, or both.

<span class="mw-page-title-main">Pitch shifting</span> Audio processing technique that changes the original pitch of a sound

Pitch shifting is a sound recording technique in which the original pitch of a sound is raised or lowered. Effects units that raise or lower pitch by a pre-designated musical interval (transposition) are known as pitch shifters.

Speech perception is the process by which the sounds of language are heard, interpreted, and understood. The study of speech perception is closely linked to the fields of phonology and phonetics in linguistics and cognitive psychology and perception in psychology. Research in speech perception seeks to understand how human listeners recognize speech sounds and use this information to understand spoken language. Speech perception research has applications in building computer systems that can recognize speech, in improving speech recognition for hearing- and language-impaired listeners, and in foreign-language teaching.

The time-stretch analog-to-digital converter (TS-ADC), also known as the time-stretch enhanced recorder (TiSER), is an analog-to-digital converter (ADC) system that has the capability of digitizing very high bandwidth signals that cannot be captured by conventional electronic ADCs. Alternatively, it is also known as the photonic time-stretch (PTS) digitizer, since it uses an optical frontend. It relies on the process of time-stretch, which effectively slows down the analog signal in time before it can be digitized by a standard electronic ADC.

Psychoacoustics is the branch of psychophysics involving the scientific study of the perception of sound by the human auditory system. It is the branch of science studying the psychological responses associated with sound including noise, speech, and music. Psychoacoustics is an interdisciplinary field including psychology, acoustics, electronic engineering, physics, biology, physiology, and computer science.

<span class="mw-page-title-main">Audio forensics</span>

Audio forensics is the field of forensic science relating to the acquisition, analysis, and evaluation of sound recordings that may ultimately be presented as admissible evidence in a court of law or some other official venue.

WaveNet is a deep neural network for generating raw audio. It was created by researchers at London-based AI firm DeepMind. The technique, outlined in a paper in September 2016, is able to generate relatively realistic-sounding human-like voices by directly modelling waveforms using a neural network method trained with recordings of real speech. Tests with US English and Mandarin reportedly showed that the system outperforms Google's best existing text-to-speech (TTS) systems, although as of 2016 its text-to-speech synthesis still was less convincing than actual human speech. WaveNet's ability to generate raw waveforms means that it can model any kind of audio, including music.

<span class="mw-page-title-main">Matti Antero Karjalainen</span> Finnish speech processing researcher and inventor

Matti Antero Karjalainen was a Finnish speech processing researcher and inventor in the fields of speech synthesis, speech analysis, speech technology, audio signal processing and psychoacoustics. He was the head of Acoustics Laboratory at the Helsinki University of Technology from 1980 to 2006.

<span class="mw-page-title-main">Éric Moulines</span> French researcher in statistical learning

Éric Moulines is a French researcher in statistical learning and signal processing. He received the silver medal from the CNRS in 2010, the France Télécom prize awarded in collaboration with the French Academy of Sciences in 2011. He was appointed a Fellow of the European Association for Signal Processing in 2012 and of the Institute of Mathematical Statistics in 2016. He is General Engineer of the Corps des Mines (X81).

References

  1. Grazyna Demenko (1999). Analiza cech suprasegmentalnych jezyka polskiego na potrzeby technologii mowy (PDF) (Ph.D. thesis). Seria Jezykoznawstwo Stosowane. Vol. 17. Uniwersytet Im. Adama Mickiewicza W Poznaniu. Fig.7.1, p.63.
  2. Charpentier, F.; Stella, M. (1986). "Diphone synthesis using an overlap-add technique for speech waveforms concatenation". ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing. Vol. 11. pp. 2015–2018. doi:10.1109/ICASSP.1986.1168657. S2CID   62440369.