In signal processing, Feature space Maximum Likelihood Linear Regression (fMLLR) is a global feature transform that are typically applied in a speaker adaptive way, where fMLLR transforms acoustic features to speaker adapted features by a multiplication operation with a transformation matrix. In some literature, fMLLR is also known as the Constrained Maximum Likelihood Linear Regression (cMLLR).
fMLLR transformations are trained in a maximum likelihood sense on adaptation data. These transformations may be estimated in many ways, but only maximum likelihood (ML) estimation is considered in fMLLR. The fMLLR transformation is trained on a particular set of adaptation data, such that it maximizes the likelihood of that adaptation data given a current model-set.
This technique is a widely used approach for speaker adaptation in HMM-based speech recognition. [1] [2] Later research [3] also shows that fMLLR is an excellent acoustic feature for DNN/HMM [4] hybrid speech recognition models.
The advantage of fMLLR includes the following:
Major problem and disadvantage of fMLLR:
Feature transform of fMLLR can be easily computed with the open source speech tool Kaldi, the Kaldi script uses the standard estimation scheme described in Appendix B of the original paper, [1] in particular the section Appendix B.1 "Direct method over rows".
In the Kaldi formulation, fMLLR is an affine feature transform of the form →, which can be written in the form →W, where = is the acoustic feature with a 1 appended. Note that this differs from some of the literature where the 1 comes first as =.
The sufficient statistics stored are:
where is the inverse co-variance matrix.
And for where is the feature dimension:
For a thorough review that explains fMLLR and the commonly used estimation techniques, see the original paper "Maximum likelihood linear transformations for HMM-based speech recognition [1] ".
Note that the Kaldi script that performs the feature transforms of fMLLR differs with [1] by using a column of the inverse in place of the cofactor row. In other words, the factor of the determinant is ignored, as it does not affect the transform result and can causes potential danger of numerical underflow or overflow.
Experiment result shows that by using the fMLLR feature in speech recognition, constant improvement is gained over other acoustic features on various commonly used benchmark datasets (TIMIT, LibriSpeech, etc).
In particular, fMLLR features outperform MFCCs and FBANKs coefficients, which is mainly due to the speaker adaptation process that fMLLR performs. [3]
In, [3] phoneme error rate (PER, %) is reported for the test set of TIMIT with various neural architectures:
Models/Features | MFCC | FBANK | fMLLR |
---|---|---|---|
MLP | 18.2 | 18.7 | 16.7 |
RNN | 17.7 | 17.2 | 15.9 |
LSTM | 15.1 | 14.3 | 14.5 |
GRU | 16.0 | 15.2 | 14.9 |
Li-GRU | 15.3 | 14.9 | 14.2 |
As expected, fMLLR features outperform MFCCs and FBANKs coefficients despite the use of different model architecture.
Where MLP (multi-layer perceptron) serves as a simple baseline, on the other hand RNN, LSTM, and GRU are all well known recurrent models.
The Li-GRU [5] architecture is based on a single gate and thus saves 33% of the computations over a standard GRU model, Li-GRU thus effectively address the gradient vanishing problem of recurrent models.
As a result, the best performance is obtained with the Li-GRU model on fMLLR features.
fMLLR can be extracted as reported in the s5 recipe of Kaldi.
Kaldi scripts can certainly extract fMLLR features on different dataset, below are the basic example steps to extract fMLLR features from the open source speech corpora Librispeech.
Note that the instructions below are for the subsets train-clean-100
,train-clean-360
,dev-clean
, and test-clean
,
but they can be easily extended to support the other sets dev-other
, test-other
, and train-other-500.
$KALDI_ROOT/egs/librispeech/s5/
with the files in the repository.$KALDI_ROOT/egs/librispeech/s5/cmd.sh
to replace queue.pl
to run.pl
:exporttrain_cmd="run.pl --mem 2G"exportdecode_cmd="run.pl --mem 4G"exportmkgraph_cmd="run.pl --mem 8G"
data
path in run.sh
to your LibriSpeech data path, the directory LibriSpeech/
should be under that path. For example:data=/media/user/SSD# example path
flac
with: sudo apt-get install flac
run.sh
for LibriSpeech at least until Stage 13 (included), for simplicity you can use the modified run.sh.exp/tri4b/trans.*
files into exp/tri4b/decode_tgsmall_train_clean_*/
with the following command:mkdirexp/tri4b/decode_tgsmall_train_clean_100&&cpexp/tri4b/trans.*exp/tri4b/decode_tgsmall_train_clean_100/
#!/bin/bash../cmd.sh## You'll want to change cmd.sh to something that will work on your system.../path.sh## Source the tools/utils (import the queue.pl)gmmdir=exp/tri4b forchunkindev_cleantest_cleantrain_clean_100train_clean_360;dodir=fmllr/$chunksteps/nnet/make_fmllr_feats.sh--nj10--cmd"$train_cmd"\--transform-dir$gmmdir/decode_tgsmall_$chunk\$dirdata/$chunk$gmmdir$dir/log$dir/data||exit1compute-cmvn-stats--spk2utt=ark:data/$chunk/spk2uttscp:fmllr/$chunk/feats.scpark:$dir/data/cmvn_speaker.ark done
# alignments on dev_clean and test_cleansteps/align_fmllr.sh--nj10data/dev_cleandata/langexp/tri4bexp/tri4b_ali_dev_clean steps/align_fmllr.sh--nj10data/test_cleandata/langexp/tri4bexp/tri4b_ali_test_clean steps/align_fmllr.sh--nj30data/train_clean_100data/langexp/tri4bexp/tri4b_ali_clean_100 steps/align_fmllr.sh--nj30data/train_clean_360data/langexp/tri4bexp/tri4b_ali_clean_360
#!/bin/bashdata=/user/kaldi/egs/librispeech/s5## You'll want to change this path to something that will work on your system.rm-rf$data/fmllr_cmvn/ mkdir$data/fmllr_cmvn/ forpartindev_cleantest_cleantrain_clean_100train_clean_360;domkdir$data/fmllr_cmvn/$part/ apply-cmvn--utt2spk=ark:$data/fmllr/$part/utt2spkark:$data/fmllr/$part/data/cmvn_speaker.arkscp:$data/fmllr/$part/feats.scpark:-|add-deltas--delta-order=0ark:-ark:$data/fmllr_cmvn/$part/fmllr_cmvn.ark donedu-sh$data/fmllr_cmvn/* echo"Done!"
pythonark2libri.py
Supervised learning (SL) is a paradigm in machine learning where input objects and a desired output value train a model. The training data is processed, building a function that maps new data on expected output values. An optimal scenario will allow for the algorithm to correctly determine output values for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way. This statistical quality of an algorithm is measured through the so-called generalization error.
Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech to text (STT). It incorporates knowledge and research in the computer science, linguistics and computer engineering fields. The reverse process is speech synthesis.
Vector quantization (VQ) is a classical quantization technique from signal processing that allows the modeling of probability density functions by the distribution of prototype vectors. It was originally used for data compression. It works by dividing a large set of points (vectors) into groups having approximately the same number of points closest to them. Each group is represented by its centroid point, as in k-means and some other clustering algorithms. In simpler terms, vector quantization chooses a set of points to represent a larger set of points.
In statistics, naive Bayes classifiers are a family of linear "probabilistic classifiers" which assumes that the features are conditionally independent, given the target class. The strength (naivity) of this assumption is what gives the classifier its name. These classifiers are among the simplest Bayesian network models.
A hidden Markov model (HMM) is a Markov model in which the observations are dependent on a latent Markov process. An HMM requires that there be an observable process whose outcomes depend on the outcomes of in a known way. Since cannot be observed directly, the goal is to learn about state of by observing By definition of being a Markov model, an HMM has an additional requirement that the outcome of at time must be "influenced" exclusively by the outcome of at and that the outcomes of and at must be conditionally independent of at given at time Estimation of the parameters in an HMM can be performed using maximum likelihood. For linear chain HMMs, the Baum–Welch algorithm can be used to estimate the parameters.
Pattern recognition is the task of assigning a class to an observation based on patterns extracted from data. While similar, pattern recognition (PR) is not to be confused with pattern machines (PM) which may possess (PR) capabilities but their primary function is to distinguish and create emergent pattern. PR has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning. Pattern recognition has its origins in statistics and engineering; some modern approaches to pattern recognition include the use of machine learning, due to the increased availability of big data and a new abundance of processing power.
The Viterbi algorithm is a dynamic programming algorithm for obtaining the maximum a posteriori probability estimate of the most likely sequence of hidden states—called the Viterbi path—that results in a sequence of observed events, especially in the context of Markov information sources and hidden Markov models (HMM).
In sound processing, the mel-frequency cepstrum (MFC) is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency.
In electrical engineering, statistical computing and bioinformatics, the Baum–Welch algorithm is a special case of the expectation–maximization algorithm used to find the unknown parameters of a hidden Markov model (HMM). It makes use of the forward-backward algorithm to compute the statistics for the expectation step.
In image processing, a Gabor filter, named after Dennis Gabor, who first proposed it as a 1D filter. The Gabor filter was first generalized to 2D by Gösta Granlund, by adding a reference direction. The Gabor filter is a linear filter used for texture analysis, which essentially means that it analyzes whether there is any specific frequency content in the image in specific directions in a localized region around the point or region of analysis. Frequency and orientation representations of Gabor filters are claimed by many contemporary vision scientists to be similar to those of the human visual system. They have been found to be particularly appropriate for texture representation and discrimination. In the spatial domain, a 2D Gabor filter is a Gaussian kernel function modulated by a sinusoidal plane wave.
Conditional random fields (CRFs) are a class of statistical modeling methods often applied in pattern recognition and machine learning and used for structured prediction. Whereas a classifier predicts a label for a single sample without considering "neighbouring" samples, a CRF can take context into account. To do so, the predictions are modelled as a graphical model, which represents the presence of dependencies between the predictions. What kind of graph is used depends on the application. For example, in natural language processing, "linear chain" CRFs are popular, for which each prediction is dependent only on its immediate neighbours. In image processing, the graph typically connects locations to nearby and/or similar locations to enforce that they receive similar predictions.
CMU Sphinx, also called Sphinx for short, is the general term to describe a group of speech recognition systems developed at Carnegie Mellon University. These include a series of speech recognizers and an acoustic model trainer (SphinxTrain).
An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data. An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction.
In statistics, a maximum-entropy Markov model (MEMM), or conditional Markov model (CMM), is a graphical model for sequence labeling that combines features of hidden Markov models (HMMs) and maximum entropy (MaxEnt) models. An MEMM is a discriminative model that extends a standard maximum entropy classifier by assuming that the unknown values to be learnt are connected in a Markov chain rather than being conditionally independent of each other. MEMMs find applications in natural language processing, specifically in part-of-speech tagging and information extraction.
Kaldi is an open-source speech recognition toolkit written in C++ for speech recognition and signal processing, freely available under the Apache License v2.0.
Apache SystemDS is an open source ML system for the end-to-end data science lifecycle.
Gated recurrent units (GRUs) are a gating mechanism in recurrent neural networks, introduced in 2014 by Kyunghyun Cho et al. The GRU is like a long short-term memory (LSTM) with a gating mechanism to input or forget certain features, but lacks a context vector or output gate, resulting in fewer parameters than LSTM. GRU's performance on certain tasks of polyphonic music modeling, speech signal modeling and natural language processing was found to be similar to that of LSTM. GRUs showed that gating is indeed helpful in general, and Bengio's team came to no concrete conclusion on which of the two gating units was better.
A transformer is a deep learning architecture based on the multi-head attention mechanism. It is notable for not containing any recurrent units, and thus requires less training time than previous recurrent neural architectures, such as long short-term memory (LSTM), and its later variation has been prevalently adopted for training large language models on large (language) datasets, such as the Wikipedia corpus and Common Crawl. Input text is split into n-grams encoded as tokens and each token is converted into a vector via looking up from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism allowing the signal for key tokens to be amplified and less important tokens to be diminished. Though the transformer paper was published in 2017, the softmax-based attention mechanism was proposed in 2014 for machine translation, and the Fast Weight Controller, similar to a transformer, was proposed in 1992.
An energy-based model (EBM) is a form of generative model (GM) imported directly from statistical physics to learning.
Deep learning speech synthesis refers to the application of deep learning models to generate natural-sounding human speech from written text (text-to-speech) or spectrum (vocoder). Deep neural networks (DNN) are trained using a large amount of recorded speech and, in the case of a text-to-speech system, the associated labels and/or input text.