Noisy channel model

Last updated November 05, 2024

The noisy channel model is a framework used in spell checkers, question answering, speech recognition, and machine translation. In this model, the goal is to find the intended word given a word where the letters have been scrambled in some manner.

In spell-checking

See Chapter B of.^[1]

Given an alphabet $\Sigma$ , let $\Sigma ^{*}$ be the set of all finite strings over $\Sigma$ . Let the dictionary $D$ of valid words be some subset of $\Sigma ^{*}$ , i.e., $D\subseteq \Sigma ^{*}$ .

The noisy channel is the matrix

\Gamma _{ws}=\Pr(s|w)

,

where $w\in D$ is the intended word and $s\in \Sigma ^{*}$ is the scrambled word that was actually received.

The goal of the noisy channel model is to find the intended word given the scrambled word that was received. The decision function ${\displaystyle \sigma$ is a function that, given a scrambled word, returns the intended word.

Methods of constructing a decision function include the maximum likelihood rule, the maximum a posteriori rule, and the minimum distance rule.

In some cases, it may be better to accept the scrambled word as the intended word rather than attempt to find an intended word in the dictionary. For example, the word schönfinkeling may not be in the dictionary, but might in fact be the intended word.

Example

Consider the English alphabet $\Sigma =\{a,b,c,...,y,z,A,B,...,Z,...\}$ . Some subset $D\subseteq \Sigma ^{*}$ makes up the dictionary of valid English words.

There are several mistakes that may occur while typing, including:

Missing letters, e.g., leter instead of letter
Accidental letter additions, e.g., misstake instead of mistake
Swapping letters, e.g., recieved instead of received
Replacing letters, e.g., fimite instead of finite

To construct the noisy channel matrix $\Gamma$ , we must consider the probability of each mistake, given the intended word ( $\Pr(s|w)$ for all $w\in D$ and $s\in \Sigma ^{*}$ ). These probabilities may be gathered, for example, by considering the Damerau–Levenshtein distance between $s$ and $w$ or by comparing the draft of an essay with one that has been manually edited for spelling.

In machine translation

One naturally wonders if the problem of translation could conceivably be treated as a problem in cryptography. When I look at an article in Russian, I say: 'This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.
— Warren Weaver, Letter to Norbert Wiener, March 4, 1947

See chapter 1, and chapter 25 of.^[2]

Suppose we want to translate a foreign language to English, we could model $P(E|F)$ directly: the probability that we have English sentence E given foreign sentence F, then we pick the most likely one ${\hat {E}}=\arg \max _{E}P(E|F)$ . However, by Bayes law, we have the equivalent equation: ${\hat {E}}={\underset {E\in {\text{ English }}}{\operatorname {argmax} }}\overbrace {P(F\mid E)} ^{\text{translation model }}\overbrace {P(E)} ^{\text{language model}}$ The benefit of the noisy-channel model is in terms of data: If collecting a parallel corpus is costly, then we would have only a small parallel corpus, so we can only train a moderately good English-to-foreign translation model, and a moderately good foreign-to-English translation model. However, we can collect a large corpus in the foreign language only, and a large corpus in the English language only, to train two good language models. Combining these four models, we immediately get a good English-to-foreign translator and a good foreign-to-English translator.^[3]

The cost of noisy-channel model is that using Bayesian inference is more costly than using a translation model directly. Instead of reading out the most likely translation by $\arg \max _{E}P(E|F)$ , it would have to read out predictions by both the translation model and the language model, multiply them, and search for the highest number.

In speech recognition

Speech recognition can be thought of as translating from a sound-language to a text-language. Consequently, we have ${\hat {T}}={\underset {T\in {\text{ Text }}}{\operatorname {argmax} }}\overbrace {P(S\mid T)} ^{\text{speech model }}\overbrace {P(T)} ^{\text{language model}}$ where $P(S|T)$ is the probability that a speech sound S is produced if the speaker is intending to say text T. Intuitively, this equation states that the most likely text is a text that's both a likely text in the language, and produces the speech sound with high probability.

The utility of the noisy-channel model is not in capacity. Theoretically, any noisy-channel model can be replicated by a direct $P(T|S)$ model. However, the noisy-channel model factors the model into two parts which are appropriate for the situation, and consequently it is generally more well-behaved.

When a human speaks, it does not produce the sound directly, but first produces the text it wants to speak in the language centers of the brain, then the text is translated into sound by the motor cortex, vocal cords, and other parts of the body. The noisy-channel model matches this model of the human, and so it is appropriate. This is justified in the practical success of noisy-channel model in speech recognition.

Example

Consider the sound-language sentence (written in IPA for English) S = aɪ wʊd laɪk wʌn tuː. There are three possible texts $T_{1},T_{2},T_{3}$ :

$T_{1}=$ I would like one to.
$T_{2}=$ I would like one too.
$T_{3}=$ I would like one two.

that are equally likely, in the sense that $P(S|T_{1})=P(S|T_{2})=P(S|T_{3})$ . With a good English language model, we would have $P(T_{2})>P(T_{1})>P(T_{3})$ , since the second sentence is grammatical, the first is not quite, but close to a grammatical one (such as "I would like one to [go]."), while the third one is far from grammatical.

Consequently, the noisy-channel model would output $T_{2}$ as the best transcription.

Related Research Articles

In physics, the cross section is a measure of the probability that a specific process will take place in a collision of two particles. For example, the Rutherford cross-section is a measure of probability that an alpha particle will be deflected by a given angle during an interaction with an atomic nucleus. Cross section is typically denoted $σ$ (sigma) and is expressed in units of area, more specifically in barns. In a way, it can be thought of as the size of the object that the excitation must hit in order for the process to occur, but more exactly, it is a parameter of a stochastic process.

In statistics, naive Bayes classifiers are a family of linear "probabilistic classifiers" which assumes that the features are conditionally independent, given the target class. The strength (naivety) of this assumption is what gives the classifier its name. These classifiers are among the simplest Bayesian network models.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

In theoretical linguistics and computational linguistics, probabilistic context free grammars (PCFGs) extend context-free grammars, similar to how hidden Markov models extend regular grammars. Each production is assigned a probability. The probability of a derivation (parse) is the product of the probabilities of the productions used in that derivation. These probabilities can be viewed as parameters of the model, and for large problems it is convenient to learn these parameters via machine learning. A probabilistic grammar's validity is constrained by context of its training dataset.

<span class="mw-page-title-main">Girsanov theorem</span> Theorem on changes in stochastic processes

In probability theory, Girsanov's theorem or the Cameron-Martin-Girsanov theorem tells how stochastic processes change under changes in measure. The theorem is especially important in the theory of financial mathematics as it tells how to convert from the physical measure, which describes the probability that an underlying instrument will take a particular value or values, to the risk-neutral measure which is a very useful tool for evaluating the value of derivatives on the underlying.

In control systems, sliding mode control (SMC) is a nonlinear control method that alters the dynamics of a nonlinear system by applying a discontinuous control signal that forces the system to "slide" along a cross-section of the system's normal behavior. The state-feedback control law is not a continuous function of time. Instead, it can switch from one continuous structure to another based on the current position in the state space. Hence, sliding mode control is a variable structure control method. The multiple control structures are designed so that trajectories always move toward an adjacent region with a different control structure, and so the ultimate trajectory will not exist entirely within one control structure. Instead, it will slide along the boundaries of the control structures. The motion of the system as it slides along these boundaries is called a sliding mode and the geometrical locus consisting of the boundaries is called the sliding (hyper)surface. In the context of modern control theory, any variable structure system, like a system under SMC, may be viewed as a special case of a hybrid dynamical system as the system both flows through a continuous state space but also moves through different discrete control modes.

Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text. A matrix containing word counts per document is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Documents are then compared by cosine similarity between any two columns. Values close to 1 represent very similar documents while values close to 0 represent very dissimilar documents.

In particle physics, Fermi's interaction is an explanation of the beta decay, proposed by Enrico Fermi in 1933. The theory posits four fermions directly interacting with one another. This interaction explains beta decay of a neutron by direct coupling of a neutron with an electron, a neutrino and a proton.

A finite-state transducer (FST) is a finite-state machine with two memory tapes, following the terminology for Turing machines: an input tape and an output tape. This contrasts with an ordinary finite-state automaton, which has a single tape. An FST is a type of finite-state automaton (FSA) that maps between two sets of symbols. An FST is more general than an FSA. An FSA defines a formal language by defining a set of accepted strings, while an FST defines a relation between sets of strings.

A variance swap is an over-the-counter financial derivative that allows one to speculate on or hedge risks associated with the magnitude of movement, i.e. volatility, of some underlying product, like an exchange rate, interest rate, or stock index.

<span class="mw-page-title-main">Ornstein–Uhlenbeck process</span> Stochastic process modeling random walk with friction

In mathematics, the Ornstein–Uhlenbeck process is a stochastic process with applications in financial mathematics and the physical sciences. Its original application in physics was as a model for the velocity of a massive Brownian particle under the influence of friction. It is named after Leonard Ornstein and George Eugene Uhlenbeck.

In statistics, generalized least squares (GLS) is a method used to estimate the unknown parameters in a linear regression model. It is used when there is a non-zero amount of correlation between the residuals in the regression model. GLS is employed to improve statistical efficiency and reduce the risk of drawing erroneous inferences, as compared to conventional least squares and weighted least squares methods. It was first described by Alexander Aitken in 1935.

In mathematical finance, the SABR model is a stochastic volatility model, which attempts to capture the volatility smile in derivatives markets. The name stands for "stochastic alpha, beta, rho", referring to the parameters of the model. The SABR model is widely used by practitioners in the financial industry, especially in the interest rate derivative markets. It was developed by Patrick S. Hagan, Deep Kumar, Andrew Lesniewski, and Diana Woodward.

Covariance matrix adaptation evolution strategy (CMA-ES) is a particular kind of strategy for numerical optimization. Evolution strategies (ES) are stochastic, derivative-free methods for numerical optimization of non-linear or non-convex continuous optimization problems. They belong to the class of evolutionary algorithms and evolutionary computation. An evolutionary algorithm is broadly based on the principle of biological evolution, namely the repeated interplay of variation and selection: in each generation (iteration) new individuals are generated by variation of the current parental individuals, usually in a stochastic way. Then, some individuals are selected to become the parents in the next generation based on their fitness or objective function value $. Like this, individuals with better and better -values are generated over the generation sequence.$

In finance, a volatility swap is a forward contract on the future realised volatility of a given underlying asset. Volatility swaps allow investors to trade the volatility of an asset directly, much as they would trade a price index. Its payoff at expiration is equal to

Sliding window based part-of-speech tagging is used to part-of-speech tag a text.

In filtering theory the Kushner equation is an equation for the conditional probability density of the state of a stochastic non-linear dynamical system, given noisy measurements of the state. It therefore provides the solution of the nonlinear filtering problem in estimation theory. The equation is sometimes referred to as the Stratonovich–Kushnerequation. However, the correct equation in terms of Itō calculus was first derived by Kushner although a more heuristic Stratonovich version of it appeared already in Stratonovich's works in late fifties. However, the derivation in terms of Itō calculus is due to Richard Bucy.

Uniform convergence in probability is a form of convergence in probability in statistical asymptotic theory and probability theory. It means that, under certain conditions, the empirical frequencies of all events in a certain event-family converge to their theoretical probabilities. Uniform convergence in probability has applications to statistics as well as machine learning as part of statistical learning theory.

A transformer is a deep learning architecture developed by researchers at Google and based on the multi-head attention mechanism, proposed in the 2017 paper "Attention Is All You Need". Text is converted to numerical representations called tokens, and each token is converted into a vector via lookup from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism, allowing the signal for key tokens to be amplified and less important tokens to be diminished.

In machine learning, diffusion models, also known as diffusion probabilistic models or score-based generative models, are a class of latent variable generative models. A diffusion model consists of three major components: the forward process, the reverse process, and the sampling procedure. The goal of diffusion models is to learn a diffusion process for a given dataset, such that the process can generate new elements that are distributed similarly as the original dataset. A diffusion model models data as generated by a diffusion process, whereby a new datum performs a random walk with drift through the space of all possible data. A trained diffusion model can be sampled in many ways, with different efficiency and quality.

References

↑ Jurafsky, Dan (2009). Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition. James H. Martin (2nd ed.). Upper Saddle River, N.J. ISBN 978-0-13-187321-6. OCLC 213375806.{{cite book}}: CS1 maint: location missing publisher (link)
↑ Brown, Peter F.; Della Pietra, Stephen A.; Della Pietra, Vincent J.; Mercer, Robert L. (1993). Hirschberg, Julia (ed.). "The Mathematics of Statistical Machine Translation: Parameter Estimation". Computational Linguistics. 19 (2): 263–311.

Brill, Eric; Moore, Robert C. (Jan 2000). "An Improved Error Model for Noisy Channel Spelling Correction". Proceedings of ACL 2000: 286–293. doi: 10.3115/1075218.1075255 .

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Speech and Language Processing. Daniel Jurafsky & James H. Martin. Copyright © 2023. All rights reserved. Draft of January 7, 2023. https://web.stanford.edu/~jurafsky/slp3/B.pdf

[2] Jurafsky, Dan (2009). Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition. James H. Martin (2nd ed.). Upper Saddle River, N.J. ISBN 978-0-13-187321-6. OCLC 213375806.{{cite book}}: CS1 maint: location missing publisher (link)

[3] Brown, Peter F.; Della Pietra, Stephen A.; Della Pietra, Vincent J.; Mercer, Robert L. (1993). Hirschberg, Julia (ed.). "The Mathematics of Statistical Machine Translation: Parameter Estimation". Computational Linguistics. 19 (2): 263–311.

[1]

[2]

[3]

Noisy channel model

Contents

In spell-checking

Example

In machine translation

In speech recognition

Example

See also

Related Research Articles

References