Speech processing

Last updated

Speech processing is the study of speech signals and the processing methods of signals. The signals are usually processed in a digital representation, so speech processing can be regarded as a special case of digital signal processing, applied to speech signals. Aspects of speech processing includes the acquisition, manipulation, storage, transfer and output of speech signals. Different speech processing tasks include speech recognition, speech synthesis, speaker diarization, speech enhancement, speaker recognition, etc. [1]

Contents

History

Early attempts at speech processing and recognition were primarily focused on understanding a handful of simple phonetic elements such as vowels. In 1952, three researchers at Bell Labs, Stephen. Balashek, R. Biddulph, and K. H. Davis, developed a system that could recognize digits spoken by a single speaker. [2] Pioneering works in field of speech recognition using analysis of its spectrum were reported in the 1940s. [3]

Linear predictive coding (LPC), a speech processing algorithm, was first proposed by Fumitada Itakura of Nagoya University and Shuzo Saito of Nippon Telegraph and Telephone (NTT) in 1966. [4] Further developments in LPC technology were made by Bishnu S. Atal and Manfred R. Schroeder at Bell Labs during the 1970s. [4] LPC was the basis for voice-over-IP (VoIP) technology, [4] as well as speech synthesizer chips, such as the Texas Instruments LPC Speech Chips used in the Speak & Spell toys from 1978. [5]

One of the first commercially available speech recognition products was Dragon Dictate, released in 1990. In 1992, technology developed by Lawrence Rabiner and others at Bell Labs was used by AT&T in their Voice Recognition Call Processing service to route calls without a human operator. By this point, the vocabulary of these systems was larger than the average human vocabulary. [6]

By the early 2000s, the dominant speech processing strategy started to shift away from Hidden Markov Models towards more modern neural networks and deep learning.[ citation needed ]

Techniques

Dynamic time warping

Dynamic time warping (DTW) is an algorithm for measuring similarity between two temporal sequences, which may vary in speed. In general, DTW is a method that calculates an optimal match between two given sequences (e.g. time series) with certain restriction and rules. The optimal match is denoted by the match that satisfies all the restrictions and the rules and that has the minimal cost, where the cost is computed as the sum of absolute differences, for each matched pair of indices, between their values.[ citation needed ]

Hidden Markov models

A hidden Markov model can be represented as the simplest dynamic Bayesian network. The goal of the algorithm is to estimate a hidden variable x(t) given a list of observations y(t). By applying the Markov property, the conditional probability distribution of the hidden variable x(t) at time t, given the values of the hidden variable x at all times, depends only on the value of the hidden variable x(t − 1). Similarly, the value of the observed variable y(t) only depends on the value of the hidden variable x(t) (both at time t).[ citation needed ]

Artificial neural networks

An artificial neural network (ANN) is based on a collection of connected units or nodes called artificial neurons, which loosely model the neurons in a biological brain. Each connection, like the synapses in a biological brain, can transmit a signal from one artificial neuron to another. An artificial neuron that receives a signal can process it and then signal additional artificial neurons connected to it. In common ANN implementations, the signal at a connection between artificial neurons is a real number, and the output of each artificial neuron is computed by some non-linear function of the sum of its inputs.[ citation needed ]

Phase-aware processing

Phase is usually supposed to be random uniform variable and thus useless. This is due wrapping of phase: [7] result of arctangent function is not continuous due to periodical jumps on . After phase unwrapping (see, [8] Chapter 2.3; Instantaneous phase and frequency), it can be expressed as: [7] [9] , where is linear phase ( is temporal shift at each frame of analysis), is phase contribution of the vocal tract and phase source. [9] Obtained phase estimations can be used for noise reduction: temporal smoothing of instantaneous phase [10] and its derivatives by time (instantaneous frequency) and frequency (group delay), [11] smoothing of phase across frequency. [11] Joined amplitude and phase estimators can recover speech more accurately basing on assumption of von Mises distribution of phase. [9]

Applications

See also

Related Research Articles

Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech to text (STT). It incorporates knowledge and research in the computer science, linguistics and computer engineering fields. The reverse process is speech synthesis.

Linear predictive coding (LPC) is a method used mostly in audio signal processing and speech processing for representing the spectral envelope of a digital signal of speech in compressed form, using the information of a linear predictive model.

In signal processing, group delay and phase delay are two related ways of describing how a signal's frequency components are delayed in time when passing through a linear time-invariant (LTI) system. Phase delay describes the time shift of a sinusoidal component. Group delay describes the time shift of the envelope of a wave packet, a "pack" or "group" of oscillations centered around one frequency that travel together, formed for instance by multiplying a sine wave by an envelope.

Time stretching is the process of changing the speed or duration of an audio signal without affecting its pitch. Pitch scaling is the opposite: the process of changing the pitch without affecting the speed. Pitch shift is pitch scaling implemented in an effects unit and intended for live performance. Pitch control is a simpler process which affects pitch and speed simultaneously by slowing down or speeding up a recording.

Vector quantization (VQ) is a classical quantization technique from signal processing that allows the modeling of probability density functions by the distribution of prototype vectors. It was originally used for data compression. It works by dividing a large set of points (vectors) into groups having approximately the same number of points closest to them. Each group is represented by its centroid point, as in k-means and some other clustering algorithms.

<span class="mw-page-title-main">Chirp</span> Frequency swept signal

A chirp is a signal in which the frequency increases (up-chirp) or decreases (down-chirp) with time. In some sources, the term chirp is used interchangeably with sweep signal. It is commonly applied to sonar, radar, and laser systems, and to other applications, such as in spread-spectrum communications. This signal type is biologically inspired and occurs as a phenomenon due to dispersion. It is usually compensated for by using a matched filter, which can be part of the propagation channel. Depending on the specific performance measure, however, there are better techniques both for radar and communication. Since it was used in radar and space, it has been adopted also for communication standards. For automotive radar applications, it is usually called linear frequency modulated waveform (LFMW).

A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobservable ("hidden") states. As part of the definition, HMM requires that there be an observable process whose outcomes are "influenced" by the outcomes of in a known way. Since cannot be observed directly, the goal is to learn about by observing HMM has an additional requirement that the outcome of at time must be "influenced" exclusively by the outcome of at and that the outcomes of and at must be conditionally independent of at given at time

<span class="mw-page-title-main">Spectrogram</span> Visual representation of the spectrum of frequencies of a signal as it varies with time

A spectrogram is a visual representation of the spectrum of frequencies of a signal as it varies with time. When applied to an audio signal, spectrograms are sometimes called sonographs, voiceprints, or voicegrams. When the data are represented in a 3D plot they may be called waterfall displays.

Lawrence R. Rabiner is an electrical engineer working in the fields of digital signal processing and speech processing; in particular in digital signal processing for automatic speech recognition. He has worked on systems for AT&T Corporation for speech recognition.

In electrical engineering, statistical computing and bioinformatics, the Baum–Welch algorithm is a special case of the expectation–maximization algorithm used to find the unknown parameters of a hidden Markov model (HMM). It makes use of the forward-backward algorithm to compute the statistics for the expectation step.

<span class="mw-page-title-main">Sensor array</span> Group of sensors used to increase gain or dimensionality over a single sensor

A sensor array is a group of sensors, usually deployed in a certain geometry pattern, used for collecting and processing electromagnetic or acoustic signals. The advantage of using a sensor array over using a single sensor lies in the fact that an array adds new dimensions to the observation, helping to estimate more parameters and improve the estimation performance. For example an array of radio antenna elements used for beamforming can increase antenna gain in the direction of the signal while decreasing the gain in other directions, i.e., increasing signal-to-noise ratio (SNR) by amplifying the signal coherently. Another example of sensor array application is to estimate the direction of arrival of impinging electromagnetic waves. The related processing method is called array signal processing. A third examples includes chemical sensor arrays, which utilize multiple chemical sensors for fingerprint detection in complex mixtures or sensing environments. Application examples of array signal processing include radar/sonar, wireless communications, seismology, machine condition monitoring, astronomical observations fault diagnosis, etc.

In mathematics and signal processing, an analytic signal is a complex-valued function that has no negative frequency components. The real and imaginary parts of an analytic signal are real-valued functions related to each other by the Hilbert transform.

<span class="mw-page-title-main">Boltzmann machine</span> Type of stochastic recurrent neural network

A Boltzmann machine is a stochastic spin-glass model with an external field, i.e., a Sherrington–Kirkpatrick model, that is a stochastic Ising model. It is a statistical physics technique applied in the context of cognitive science. It is also classified as a Markov random field.

Line spectral pairs (LSP) or line spectral frequencies (LSF) are used to represent linear prediction coefficients (LPC) for transmission over a channel. LSPs have several properties that make them superior to direct quantization of LPCs. For this reason, LSPs are very useful in speech coding.

In statistical signal processing, the goal of spectral density estimation (SDE) or simply spectral estimation is to estimate the spectral density of a signal from a sequence of time samples of the signal. Intuitively speaking, the spectral density characterizes the frequency content of the signal. One purpose of estimating the spectral density is to detect any periodicities in the data, by observing peaks at the frequencies corresponding to these periodicities.

<span class="mw-page-title-main">Activation function</span> Artificial neural network node function

Activation function of a node in an artificial neural network is a function that calculates the output of the node. Nontrivial problems can be solved only using a nonlinear activation function. Modern activation functions include the smooth version of the ReLU, the GELU, which was used in the 2018 BERT model, the logistic (sigmoid) function used in the 2012 speech recognition model developed by Hinton et al, the ReLU used in the 2012 AlexNet computer vision model and in the 2015 ResNet model.

Time-inhomogeneous hidden Bernoulli model (TI-HBM) is an alternative to hidden Markov model (HMM) for automatic speech recognition. Contrary to HMM, the state transition process in TI-HBM is not a Markov-dependent process, rather it is a generalized Bernoulli process. This difference leads to elimination of dynamic programming at state-level in TI-HBM decoding process. Thus, the computational complexity of TI-HBM for probability evaluation and state estimation is . The TI-HBM is able to model acoustic-unit duration by using a built-in parameter named survival probability. The TI-HBM is simpler and faster than HMM in a phoneme recognition task, but its performance is comparable to HMM.

There are many types of artificial neural networks (ANN).

<span class="mw-page-title-main">Phase stretch transform</span>

Phase stretch transform (PST) is a computational approach to signal and image processing. One of its utilities is for feature detection and classification. PST is related to time stretch dispersive Fourier transform. It transforms the image by emulating propagation through a diffractive medium with engineered 3D dispersive property. The operation relies on symmetry of the dispersion profile and can be understood in terms of dispersive eigenfunctions or stretch modes. PST performs similar functionality as phase-contrast microscopy, but on digital images. PST can be applied to digital images and temporal data. It is a physics-based feature engineering algorithm.

Biing Hwang "Fred" Juang is a communication and information scientist, best known for his work in speech coding, speech recognition and acoustic signal processing. He joined Georgia Institute of Technology in 2002 as Motorola Foundation Chair Professor in the School of Electrical & Computer Engineering.

References

  1. Sahidullah, Md; Patino, Jose; Cornell, Samuele; Yin, Ruiking; Sivasankaran, Sunit; Bredin, Herve; Korshunov, Pavel; Brutti, Alessio; Serizel, Romain; Vincent, Emmanuel; Evans, Nicholas; Marcel, Sebastien; Squartini, Stefano; Barras, Claude (2019-11-06). "The Speed Submission to DIHARD II: Contributions & Lessons Learned". arXiv: 1911.02388 [eess.AS].
  2. Juang, B.-H.; Rabiner, L.R. (2006), "Speech Recognition, Automatic: History", Encyclopedia of Language & Linguistics, Elsevier, pp. 806–819, doi:10.1016/b0-08-044854-2/00906-8, ISBN   9780080448541
  3. Myasnikov, L. L.; Myasnikova, Ye. N. (1970). Automatic recognition of sound pattern (in Russian). Leningrad: Energiya.
  4. 1 2 3 Gray, Robert M. (2010). "A History of Realtime Digital Speech on Packet Networks: Part II of Linear Predictive Coding and the Internet Protocol" (PDF). Found. Trends Signal Process. 3 (4): 203–303. doi: 10.1561/2000000036 . ISSN   1932-8346.
  5. "VC&G - VC&G Interview: 30 Years Later, Richard Wiggins Talks Speak & Spell Development".
  6. Huang, Xuedong; Baker, James; Reddy, Raj (2014-01-01). "A historical perspective of speech recognition". Communications of the ACM. 57 (1): 94–103. doi:10.1145/2500887. ISSN   0001-0782. S2CID   6175701.
  7. 1 2 Mowlaee, Pejman; Kulmer, Josef (August 2015). "Phase Estimation in Single-Channel Speech Enhancement: Limits-Potential". IEEE/ACM Transactions on Audio, Speech, and Language Processing. 23 (8): 1283–1294. doi:10.1109/TASLP.2015.2430820. ISSN   2329-9290. S2CID   13058142 . Retrieved 2017-12-03.
  8. Mowlaee, Pejman; Kulmer, Josef; Stahl, Johannes; Mayer, Florian (2017). Single channel phase-aware signal processing in speech communication: theory and practice. Chichester: Wiley. ISBN   978-1-119-23882-9.
  9. 1 2 3 Kulmer, Josef; Mowlaee, Pejman (April 2015). "Harmonic phase estimation in single-channel speech enhancement using von Mises distribution and prior SNR". Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE. pp. 5063–5067.
  10. Kulmer, Josef; Mowlaee, Pejman (May 2015). "Phase Estimation in Single Channel Speech Enhancement Using Phase Decomposition". IEEE Signal Processing Letters. 22 (5): 598–602. doi:10.1109/LSP.2014.2365040. ISSN   1070-9908. S2CID   15503015 . Retrieved 2017-12-03.
  11. 1 2 Mowlaee, Pejman; Saeidi, Rahim; Stylianou, Yannis (July 2016). "Advances in phase-aware signal processing in speech communication". Speech Communication. 81: 1–29. doi:10.1016/j.specom.2016.04.002. ISSN   0167-6393. S2CID   17409161 . Retrieved 2017-12-03.