Noisy channel model

Last updated

The noisy channel model is a framework used in spell checkers, question answering, speech recognition, and machine translation. In this model, the goal is to find the intended word given a word where the letters have been scrambled in some manner.

Contents

In spell-checking

See Chapter B of. [1]

Given an alphabet , let be the set of all finite strings over . Let the dictionary of valid words be some subset of , i.e., .

The noisy channel is the matrix

,

where is the intended word and is the scrambled word that was actually received.

The goal of the noisy channel model is to find the intended word given the scrambled word that was received. The decision function is a function that, given a scrambled word, returns the intended word.

Methods of constructing a decision function include the maximum likelihood rule, the maximum a posteriori rule, and the minimum distance rule.

In some cases, it may be better to accept the scrambled word as the intended word rather than attempt to find an intended word in the dictionary. For example, the word schönfinkeling may not be in the dictionary, but might in fact be the intended word.

Example

Consider the English alphabet . Some subset makes up the dictionary of valid English words.

There are several mistakes that may occur while typing, including:

  1. Missing letters, e.g., leter instead of letter
  2. Accidental letter additions, e.g., misstake instead of mistake
  3. Swapping letters, e.g., recieved instead of received
  4. Replacing letters, e.g., fimite instead of finite

To construct the noisy channel matrix , we must consider the probability of each mistake, given the intended word ( for all and ). These probabilities may be gathered, for example, by considering the Damerau–Levenshtein distance between and or by comparing the draft of an essay with one that has been manually edited for spelling.

In machine translation

One naturally wonders if the problem of translation could conceivably be treated as a problem in cryptography. When I look at an article in Russian, I say: 'This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.

Warren Weaver, Letter to Norbert Wiener, March 4, 1947

See chapter 1, and chapter 25 of. [2]

Suppose we want to translate a foreign language to English, we could model directly: the probability that we have English sentence E given foreign sentence F, then we pick the most likely one . However, by Bayes law, we have the equivalent equation:

The benefit of the noisy-channel model is in terms of data: If collecting a parallel corpus is costly, then we would have only a small parallel corpus, so we can only train a moderately good English-to-foreign translation model, and a moderately good foreign-to-English translation model. However, we can collect a large corpus in the foreign language only, and a large corpus in the English language only, to train two good language models. Combining these four models, we immediately get a good English-to-foreign translator and a good foreign-to-English translator. [3]

The cost of noisy-channel model is that using Bayesian inference is more costly than using a translation model directly. Instead of reading out the most likely translation by , it would have to read out predictions by both the translation model and the language model, multiply them, and search for the highest number.

In speech recognition

Speech recognition can be thought of as translating from a sound-language to a text-language. Consequently, we have

where is the probability that a speech sound S is produced if the speaker is intending to say text T. Intuitively, this equation states that the most likely text is a text that's both a likely text in the language, and produces the speech sound with high probability.

The utility of the noisy-channel model is not in capacity. Theoretically, any noisy-channel model can be replicated by a direct model. However, the noisy-channel model factors the model into two parts which are appropriate for the situation, and consequently it is generally more well-behaved.

When a human speaks, it does not produce the sound directly, but first produces the text it wants to speak in the language centers of the brain, then the text is translated into sound by the motor cortex, vocal cords, and other parts of the body. The noisy-channel model matches this model of the human, and so it is appropriate. This is justified in the practical success of noisy-channel model in speech recognition.

Example

Consider the sound-language sentence (written in IPA for English) S = aɪ wʊd laɪk wʌn tuː. There are three possible texts :

that are equally likely, in the sense that . With a good English language model, we would have , since the second sentence is grammatical, the first is not quite, but close to a grammatical one (such as "I would like one to [go]."), while the third one is far from grammatical.

Consequently, the noisy-channel model would output as the best transcription.

See also

Related Research Articles

In physics, the cross section is a measure of the probability that a specific process will take place when some kind of radiant excitation intersects a localized phenomenon. For example, the Rutherford cross-section is a measure of probability that an alpha particle will be deflected by a given angle during an interaction with an atomic nucleus. Cross section is typically denoted σ (sigma) and is expressed in units of area, more specifically in barns. In a way, it can be thought of as the size of the object that the excitation must hit in order for the process to occur, but more exactly, it is a parameter of a stochastic process.

<span class="mw-page-title-main">Entropy (information theory)</span> Expected amount of information needed to specify the output of a stochastic data source

In information theory, the entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent to the variable's possible outcomes. Given a discrete random variable , which takes values in the alphabet and is distributed according to :

<span class="mw-page-title-main">Normal distribution</span> Probability distribution

In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is

<span class="mw-page-title-main">Naive Bayes classifier</span> Probabilistic classification algorithm

In statistics, naive Bayes classifiers are a family of linear "probabilistic classifiers" based on applying Bayes' theorem with strong (naive) independence assumptions between the features. They are among the simplest Bayesian network models, but coupled with kernel density estimation, they can achieve high accuracy levels.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

Grammar theory to model symbol strings originated from work in computational linguistics aiming to understand the structure of natural languages. Probabilistic context free grammars (PCFGs) have been applied in probabilistic modeling of RNA structures almost 40 years after they were introduced in computational linguistics.

In control systems, sliding mode control (SMC) is a nonlinear control method that alters the dynamics of a nonlinear system by applying a discontinuous control signal that forces the system to "slide" along a cross-section of the system's normal behavior. The state-feedback control law is not a continuous function of time. Instead, it can switch from one continuous structure to another based on the current position in the state space. Hence, sliding mode control is a variable structure control method. The multiple control structures are designed so that trajectories always move toward an adjacent region with a different control structure, and so the ultimate trajectory will not exist entirely within one control structure. Instead, it will slide along the boundaries of the control structures. The motion of the system as it slides along these boundaries is called a sliding mode and the geometrical locus consisting of the boundaries is called the sliding (hyper)surface. In the context of modern control theory, any variable structure system, like a system under SMC, may be viewed as a special case of a hybrid dynamical system as the system both flows through a continuous state space but also moves through different discrete control modes.

Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text. A matrix containing word counts per document is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Documents are then compared by cosine similarity between any two columns. Values close to 1 represent very similar documents while values close to 0 represent very dissimilar documents.

<span class="mw-page-title-main">Fermi's interaction</span> Mechanism of beta decay proposed in 1933

In particle physics, Fermi's interaction is an explanation of the beta decay, proposed by Enrico Fermi in 1933. The theory posits four fermions directly interacting with one another. This interaction explains beta decay of a neutron by direct coupling of a neutron with an electron, a neutrino and a proton.

A finite-state transducer (FST) is a finite-state machine with two memory tapes, following the terminology for Turing machines: an input tape and an output tape. This contrasts with an ordinary finite-state automaton, which has a single tape. An FST is a type of finite-state automaton (FSA) that maps between two sets of symbols. An FST is more general than an FSA. An FSA defines a formal language by defining a set of accepted strings, while an FST defines relations between sets of strings.

A variance swap is an over-the-counter financial derivative that allows one to speculate on or hedge risks associated with the magnitude of movement, i.e. volatility, of some underlying product, like an exchange rate, interest rate, or stock index.

<span class="mw-page-title-main">Ornstein–Uhlenbeck process</span> Stochastic process modeling random walk with friction

In mathematics, the Ornstein–Uhlenbeck process is a stochastic process with applications in financial mathematics and the physical sciences. Its original application in physics was as a model for the velocity of a massive Brownian particle under the influence of friction. It is named after Leonard Ornstein and George Eugene Uhlenbeck.

In mathematics, the Gibbs measure, named after Josiah Willard Gibbs, is a probability measure frequently seen in many problems of probability theory and statistical mechanics. It is a generalization of the canonical ensemble to infinite systems. The canonical ensemble gives the probability of the system X being in state x as

In statistics, generalized least squares (GLS) is a method used to estimate the unknown parameters in a linear regression model when there is a certain degree of correlation between the residuals in the regression model. GLS is employed to improve statistical efficiency and reduce the risk of drawing erroneous inferences compared to conventional least squares and weighted least squares methods. It was first described by Alexander Aitken in 1935.

Covariance matrix adaptation evolution strategy (CMA-ES) is a particular kind of strategy for numerical optimization. Evolution strategies (ES) are stochastic, derivative-free methods for numerical optimization of non-linear or non-convex continuous optimization problems. They belong to the class of evolutionary algorithms and evolutionary computation. An evolutionary algorithm is broadly based on the principle of biological evolution, namely the repeated interplay of variation and selection: in each generation (iteration) new individuals are generated by variation, usually in a stochastic way, of the current parental individuals. Then, some individuals are selected to become the parents in the next generation based on their fitness or objective function value . Like this, over the generation sequence, individuals with better and better -values are generated.

In finance, a volatility swap is a forward contract on the future realised volatility of a given underlying asset. Volatility swaps allow investors to trade the volatility of an asset directly, much as they would trade a price index. Its payoff at expiration is equal to

Sliding window based part-of-speech tagging is used to part-of-speech tag a text.

In filtering theory the Kushner equation is an equation for the conditional probability density of the state of a stochastic non-linear dynamical system, given noisy measurements of the state. It therefore provides the solution of the nonlinear filtering problem in estimation theory. The equation is sometimes referred to as the Stratonovich–Kushnerequation. However, the correct equation in terms of Itō calculus was first derived by Kushner although a more heuristic Stratonovich version of it appeared already in Stratonovich's works in late fifties. However, the derivation in terms of Itō calculus is due to Richard Bucy.

Uniform convergence in probability is a form of convergence in probability in statistical asymptotic theory and probability theory. It means that, under certain conditions, the empirical frequencies of all events in a certain event-family converge to their theoretical probabilities. Uniform convergence in probability has applications to statistics as well as machine learning as part of statistical learning theory.

<span class="mw-page-title-main">Transformer (machine learning model)</span> Machine learning algorithm used for natural-language processing

A transformer is a deep learning architecture, initially proposed in 2017, that relies on the parallel multi-head attention mechanism. It is notable for requiring less training time than previous recurrent neural architectures, such as long short-term memory (LSTM), and its later variation has been prevalently adopted for training large language models on large (language) datasets, such as the Wikipedia corpus and Common Crawl, by virtue of the parallelized processing of input sequence. Input text is split into n-grams encoded as tokens and each token is converted into a vector via looking up from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism allowing the signal for key tokens to be amplified and less important tokens to be diminished. Though the transformer paper was published in 2017, the softmax-based attention mechanism was proposed in 2014 for machine translation, and the Fast Weight Controller, similar to a transformer, was proposed in 1992.

References

  1. Speech and Language Processing. Daniel Jurafsky & James H. Martin. Copyright © 2023. All rights reserved. Draft of January 7, 2023. https://web.stanford.edu/~jurafsky/slp3/B.pdf
  2. Jurafsky, Dan (2009). Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition. James H. Martin (2nd ed.). Upper Saddle River, N.J. ISBN   978-0-13-187321-6. OCLC   213375806.{{cite book}}: CS1 maint: location missing publisher (link)
  3. Brown, Peter F.; Della Pietra, Stephen A.; Della Pietra, Vincent J.; Mercer, Robert L. (1993). Hirschberg, Julia (ed.). "The Mathematics of Statistical Machine Translation: Parameter Estimation". Computational Linguistics. 19 (2): 263–311.