Forward algorithm

Last updated April 26, 2024

The forward algorithm, in the context of a hidden Markov model (HMM), is used to calculate a 'belief state': the probability of a state at a certain time, given the history of evidence. The process is also known as filtering. The forward algorithm is closely related to, but distinct from, the Viterbi algorithm.

Introduction

The forward and backward algorithms should be placed within the context of probability as they appear to simply be names given to a set of standard mathematical procedures within a few fields. For example, neither "forward algorithm" nor "Viterbi" appear in the Cambridge encyclopedia of mathematics. The main observation to take away from these algorithms is how to organize Bayesian updates and inference to be computationally efficient in the context of directed graphs of variables (see sum-product networks).

For an HMM such as this one:

this probability is written as $p(x_{t}|y_{1:t})$ . Here $x(t)$ is the hidden state which is abbreviated as $x_{t}$ and $y_{1:t}$ are the observations $1$ to $t$ .

The backward algorithm complements the forward algorithm by taking into account the future history if one wanted to improve the estimate for past times. This is referred to as smoothing and the forward/backward algorithm computes $p(x_{t}|y_{1:T})$ for $1<t<T$ . Thus, the full forward/backward algorithm takes into account all evidence. Note that a belief state can be calculated at each time step, but doing this does not, in a strict sense, produce the most likely state sequence, but rather the most likely state at each time step, given the previous history. In order to achieve the most likely sequence, the Viterbi algorithm is required. It computes the most likely state sequence given the history of observations, that is, the state sequence that maximizes $p(x_{0:t}|y_{0:t})$ .

Algorithm

The goal of the forward algorithm is to compute the joint probability $p(x_{t},y_{1:t})$ , where for notational convenience we have abbreviated $x(t)$ as $x_{t}$ and $(y(1),y(2),...,y(t))$ as $y_{1:t}$ . Once the joint probability $p(x_{t},y_{1:t})$ is computed, the other probabilities $p(x_{t}|y_{1:t})$ and $p(y_{1:t})$ are easily obtained.

Both the state $x_{t}$ and observation $y_{t}$ are assumed to be discrete, finite random variables. The hidden Markov model's state transition probabilities $p(x_{t}|x_{t-1})$ , observation/emission probabilities $p(y_{t}|x_{t})$ , and initial prior probability $p(x_{0})$ are assumed to be known. Furthermore, the sequence of observations $y_{1:t}$ are assumed to be given.

Computing $p(x_{t},y_{1:t})$ naively would require marginalizing over all possible state sequences $\{x_{1:t-1}\}$ , the number of which grows exponentially with $t$ . Instead, the forward algorithm takes advantage of the conditional independence rules of the hidden Markov model (HMM) to perform the calculation recursively.

To demonstrate the recursion, let

\alpha (x_{t})=p(x_{t},y_{1:t})=\sum _{x_{t-1}}p(x_{t},x_{t-1},y_{1:t})

.

Using the chain rule to expand $p(x_{t},x_{t-1},y_{1:t})$ , we can then write

\alpha (x_{t})=\sum _{x_{t-1}}p(y_{t}|x_{t},x_{t-1},y_{1:t-1})p(x_{t}|x_{t-1},y_{1:t-1})p(x_{t-1},y_{1:t-1})

.

Because $y_{t}$ is conditionally independent of everything but $x_{t}$ , and $x_{t}$ is conditionally independent of everything but $x_{t-1}$ , this simplifies to

\alpha (x_{t})=p(y_{t}|x_{t})\sum _{x_{t-1}}p(x_{t}|x_{t-1})\alpha (x_{t-1})

.

Thus, since $p(y_{t}|x_{t})$ and $p(x_{t}|x_{t-1})$ are given by the model's emission distributions and transition probabilities, which are assumed to be known, one can quickly calculate $\alpha (x_{t})$ from $\alpha (x_{t-1})$ and avoid incurring exponential computation time.

The recursion formula given above can be written in a more compact form. Let $a_{ij}=p(x_{t}=i|x_{t-1}=j)$ be the transition probabilities and $b_{ij}=p(y_{t}=i|x_{t}=j)$ be the emission probabilities, then

\mathbf {\alpha } _{t}=\mathbf {b} _{t}^{T}\odot \mathbf {A} \mathbf {\alpha } _{t-1}

where $\mathbf {A} =[a_{ij}]$ is the transition probability matrix, $\mathbf {b} _{t}$ is the i-th row of the emission probability matrix $\mathbf {B} =[b_{ij}]$ which corresponds to the actual observation $y_{t}=i$ at time $t$ , and $\mathbf {\alpha } _{t}=[\alpha (x_{t}=1),\ldots ,\alpha (x_{t}=n)]^{T}$ is the alpha vector. The $\odot$ is the hadamard product between the transpose of $\mathbf {b} _{t}$ and $\mathbf {A} \mathbf {\alpha } _{t-1}$ .

The initial condition is set in accordance to the prior probability over $x_{0}$ as

\alpha (x_{0})=p(y_{0}|x_{0})p(x_{0})

.

Once the joint probability $\alpha (x_{t})=p(x_{t},y_{1:t})$ has been computed using the forward algorithm, we can easily obtain the related joint probability $p(y_{1:t})$ as

p(y_{1:t})=\sum _{x_{t}}p(x_{t},y_{1:t})=\sum _{x_{t}}\alpha (x_{t})

and the required conditional probability $p(x_{t}|y_{1:t})$ as

p(x_{t}|y_{1:t})={\frac {p(x_{t},y_{1:t})}{p(y_{1:t})}}={\frac {\alpha (x_{t})}{\sum _{x_{t}}\alpha (x_{t})}}.

Once the conditional probability has been calculated, we can also find the point estimate of $x_{t}$ . For instance, the MAP estimate of $x_{t}$ is given by

{\widehat {x}}_{t}^{MAP}=\arg \max _{x_{t}}\;p(x_{t}|y_{1:t})=\arg \max _{x_{t}}\;\alpha (x_{t}),

while the MMSE estimate of $x_{t}$ is given by

{\widehat {x}}_{t}^{MMSE}=\mathbb {E} [x_{t}|y_{1:t}]=\sum _{x_{t}}x_{t}p(x_{t}|y_{1:t})={\frac {\sum _{x_{t}}x_{t}\alpha (x_{t})}{\sum _{x_{t}}\alpha (x_{t})}}.

The forward algorithm is easily modified to account for observations from variants of the hidden Markov model as well, such as the Markov jump linear system.

Pseudocode

Initialize
$t=0$ ,
transition probabilities, $p(x_{t}|x_{t-1})$ ,
emission probabilities, $p(y_{t}|x_{t})$ ,
observed sequence, $y_{1:T}$
prior probability, $\alpha (x_{0})$
For $t=1$ to $T$
$\alpha (x_{t})=p(y_{t}|x_{t})\sum _{x_{t-1}}p(x_{t}|x_{t-1})\alpha (x_{t-1})$ .
Return $p(x_{T}|y_{1:T})={\frac {\alpha (x_{T})}{\sum _{x_{T}}\alpha (x_{T})}}$

Example

This example on observing possible states of weather from the observed condition of seaweed. We have observations of seaweed for three consecutive days as dry, damp, and soggy in order. The possible states of weather can be sunny, cloudy, or rainy. In total, there can be $3^{3}=27$ such weather sequences. Exploring all such possible state sequences is computationally very expensive. To reduce this complexity, Forward algorithm comes in handy, where the trick lies in using the conditional independence of the sequence steps to calculate partial probabilities, $\alpha (x_{t})=p(x_{t},y_{1:t})=p(y_{t}|x_{t})\sum _{x_{t-1}}p(x_{t}|x_{t-1})\alpha (x_{t-1})$ as shown in the above derivation. Hence, we can calculate the probabilities as the product of the appropriate observation/emission probability, $p(y_{t}|x_{t})$ ( probability of state $y_{t}$ seen at time t from previous observation) with the sum of probabilities of reaching that state at time t, calculated using transition probabilities. This reduces complexity of the problem from searching whole search space to just using previously computed $\alpha$ 's and transition probabilities.

Complexity

Complexity of Forward Algorithm is $\Theta (nm^{2})$ , where $m$ is the number of hidden or latent variables, like weather in the example above, and $n$ is the length of the sequence of the observed variable. This is clear reduction from the adhoc method of exploring all the possible states with a complexity of $\Theta (nm^{n})$ .

Variants of the algorithm

Hybrid Forward Algorithm:^[1] A variant of the Forward Algorithm called Hybrid Forward Algorithm (HFA) can be used for the construction of radial basis function (RBF) neural networks with tunable nodes. The RBF neural network is constructed by the conventional subset selection algorithms. The network structure is determined by combining both the stepwise forward network configuration and the continuous RBF parameter optimization. It is used to efficiently and effectively produce a parsimonious RBF neural network that generalizes well. It is achieved through simultaneous network structure determination and parameter optimization on the continuous parameter space. HFA tackles the mixed integer hard problem using an integrated analytic framework, leading to improved network performance and reduced memory usage for the network construction.

Forward Algorithm for Optimal Control in Hybrid Systems:^[2] This variant of Forward algorithm is motivated by the structure of manufacturing environments that integrate process and operations control. We derive a new property of the optimal state trajectory structure which holds under a modified condition on the cost function. This allows us to develop a low-complexity, scalable algorithm for explicitly determining the optimal controls, which can be more efficient than Forward Algorithm.

Continuous Forward Algorithm:^[3] A continuous forward algorithm (CFA) can be used for nonlinear modelling and identification using radial basis function (RBF) neural networks. The proposed algorithm performs the two tasks of network construction and parameter optimization within an integrated analytic framework, and offers two important advantages. First, the model performance can be significantly improved through continuous parameter optimization. Secondly, the neural representation can be built without generating and storing all candidate regressors, leading to significantly reduced memory usage and computational complexity.

History

The forward algorithm is one of the algorithms used to solve the decoding problem. Since the development of speech recognition^[4] and pattern recognition and related fields like computational biology which use HMMs, the forward algorithm has gained popularity.

Applications

The forward algorithm is mostly used in applications that need us to determine the probability of being in a specific state when we know about the sequence of observations. The algorithm can be applied wherever we can train a model as we receive data using Baum-Welch^[5] or any general EM algorithm. The Forward algorithm will then tell us about the probability of data with respect to what is expected from our model. One of the applications can be in the domain of Finance, where it can help decide on when to buy or sell tangible assets. It can have applications in all fields where we apply Hidden Markov Models. The popular ones include Natural language processing domains like tagging part-of-speech and speech recognition.^[4] Recently it is also being used in the domain of Bioinformatics. Forward algorithm can also be applied to perform Weather speculations. We can have a HMM describing the weather and its relation to the state of observations for few consecutive days (some examples could be dry, damp, soggy, sunny, cloudy, rainy etc.). We can consider calculating the probability of observing any sequence of observations recursively given the HMM. We can then calculate the probability of reaching an intermediate state as the sum of all possible paths to that state. Thus the partial probabilities for the final observation will hold the probability of reaching those states going through all possible paths.

Related Research Articles

A Markov chain or Markov process is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Informally, this may be thought of as, "What happens next depends only on the state of affairs now." A countably infinite sequence, in which the chain moves state at discrete time steps, gives a discrete-time Markov chain (DTMC). A continuous-time process is called a continuous-time Markov chain (CTMC). It is named after the Russian mathematician Andrey Markov.

A hidden Markov model (HMM) is a Markov model in which the observations are dependent on a latent Markov process. An HMM requires that there be an observable process $whose outcomes depend on the outcomes of in a known way. Since cannot be observed directly, the goal is to learn about state of by observing By definition of being a Markov model, an HMM has an additional requirement that the outcome of at time must be "influenced" exclusively by the outcome of at and that the outcomes of and at must be conditionally independent of at given at time Estimation of the parameters in an HMM can be performed using maximum likelihood. For linear chain HMMs, the Baum-Welch algorithm can be used to estimate the parameters.$

The Viterbi algorithm is a dynamic programming algorithm for obtaining the maximum a posteriori probability estimate of the most likely sequence of hidden states—called the Viterbi path—that results in a sequence of observed events. This is done especially in the context of Markov information sources and hidden Markov models (HMM).

In statistics, an expectation–maximization (EM) algorithm is an iterative method to find (local) maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step. It can be used, for example, to estimate a mixture of gaussians, or to solve the multiple linear regression problem.

In statistics, Gibbs sampling or a Gibbs sampler is a Markov chain Monte Carlo (MCMC) algorithm for sampling from a specified multivariate probability distribution when direct sampling from the joint distribution is difficult, but sampling from the conditional distribution is more practical. This sequence can be used to approximate the joint distribution ; to approximate the marginal distribution of one of the variables, or some subset of the variables ; or to compute an integral. Typically, some of the variables correspond to observations whose values are known, and hence do not need to be sampled.

In electrical engineering, statistical computing and bioinformatics, the Baum–Welch algorithm is a special case of the expectation–maximization algorithm used to find the unknown parameters of a hidden Markov model (HMM). It makes use of the forward-backward algorithm to compute the statistics for the expectation step.

A continuous-time Markov chain (CTMC) is a continuous stochastic process in which, for each state, the process will change state according to an exponential random variable and then move to a different state as specified by the probabilities of a stochastic matrix. An equivalent formulation describes the process as changing state according to the least value of a set of exponential random variables, one for each possible state it can move to, with the parameters determined by the current state.

In statistical classification, two main approaches are called the generative approach and the discriminative approach. These compute classifiers by different approaches, differing in the degree of statistical modelling. Terminology is inconsistent, but three major types can be distinguished, following Jebara (2004):

A generative model is a statistical model of the joint probability distribution $on given observable variable X and target variable Y;$
A discriminative model is a model of the conditional probability $of the target Y, given an observation x; and$
Classifiers computed without using a probability model are also referred to loosely as "discriminative".

In probability theory, statistics, and machine learning, recursive Bayesian estimation, also known as a Bayes filter, is a general probabilistic approach for estimating an unknown probability density function (PDF) recursively over time using incoming measurements and a mathematical process model. The process relies heavily upon mathematical concepts and models that are theorized within a study of prior and posterior probabilities known as Bayesian statistics.

A phase-type distribution is a probability distribution constructed by a convolution or mixture of exponential distributions. It results from a system of one or more inter-related Poisson processes occurring in sequence, or phases. The sequence in which each of the phases occurs may itself be a stochastic process. The distribution can be represented by a random variable describing the time until absorption of a Markov process with one absorbing state. Each of the states of the Markov process represents one of the phases.

Conditional random fields (CRFs) are a class of statistical modeling methods often applied in pattern recognition and machine learning and used for structured prediction. Whereas a classifier predicts a label for a single sample without considering "neighbouring" samples, a CRF can take context into account. To do so, the predictions are modelled as a graphical model, which represents the presence of dependencies between the predictions. What kind of graph is used depends on the application. For example, in natural language processing, "linear chain" CRFs are popular, for which each prediction is dependent only on its immediate neighbours. In image processing, the graph typically connects locations to nearby and/or similar locations to enforce that they receive similar predictions.

In probability theory and statistics, the Dirichlet-multinomial distribution is a family of discrete multivariate probability distributions on a finite support of non-negative integers. It is also called the Dirichlet compound multinomial distribution (DCM) or multivariate Pólya distribution. It is a compound probability distribution, where a probability vector p is drawn from a Dirichlet distribution with parameter vector $, and an observation drawn from a multinomial distribution with probability vector p and number of trials n . The Dirichlet parameter vector captures the prior belief about the situation and can be seen as a pseudocount: observations of each outcome that occur before the actual data is collected. The compounding corresponds to a Pólya urn scheme. It is frequently encountered in Bayesian statistics, machine learning, empirical Bayes methods and classical statistics as an overdispersed multinomial distribution.$

The forward–backward algorithm is an inference algorithm for hidden Markov models which computes the posterior marginals of all hidden state variables given a sequence of observations/emissions $, i.e. it computes, for all hidden state variables, the distribution . This inference task is usually called smoothing . The algorithm makes use of the principle of dynamic programming to efficiently compute the values that are required to obtain the posterior marginal distributions in two passes. The first pass goes forward in time while the second goes backward in time; hence the name forward-backward algorithm .$

In the field of mathematical modeling, a radial basis function network is an artificial neural network that uses radial basis functions as activation functions. The output of the network is a linear combination of radial basis functions of the inputs and neuron parameters. Radial basis function networks have many uses, including function approximation, time series prediction, classification, and system control. They were first formulated in a 1988 paper by Broomhead and Lowe, both researchers at the Royal Signals and Radar Establishment.

In statistics, a maximum-entropy Markov model (MEMM), or conditional Markov model (CMM), is a graphical model for sequence labeling that combines features of hidden Markov models (HMMs) and maximum entropy (MaxEnt) models. An MEMM is a discriminative model that extends a standard maximum entropy classifier by assuming that the unknown values to be learnt are connected in a Markov chain rather than being conditionally independent of each other. MEMMs find applications in natural language processing, specifically in part-of-speech tagging and information extraction.

Diffusion maps is a dimensionality reduction or feature extraction algorithm introduced by Coifman and Lafon which computes a family of embeddings of a data set into Euclidean space whose coordinates can be computed from the eigenvectors and eigenvalues of a diffusion operator on the data. The Euclidean distance between points in the embedded space is equal to the "diffusion distance" between probability distributions centered at those points. Different from linear dimensionality reduction methods such as principal component analysis (PCA), diffusion maps are part of the family of nonlinear dimensionality reduction methods which focus on discovering the underlying manifold that the data has been sampled from. By integrating local similarities at different scales, diffusion maps give a global description of the data-set. Compared with other methods, the diffusion map algorithm is robust to noise perturbation and computationally inexpensive.

In machine learning, the kernel embedding of distributions comprises a class of nonparametric methods in which a probability distribution is represented as an element of a reproducing kernel Hilbert space (RKHS). A generalization of the individual data-point feature mapping done in classical kernel methods, the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as inner products, distances, projections, linear transformations, and spectral analysis. This learning framework is very general and can be applied to distributions over any space $on which a sensible kernel function may be defined. For example, various kernels have been proposed for learning from data which are: vectors in, discrete classes/categories, strings, graphs/networks, images, time series, manifolds, dynamical systems, and other structured objects. The theory behind kernel embeddings of distributions has been primarily developed by Alex Smola, Le Song, Arthur Gretton, and Bernhard Schölkopf. A review of recent works on kernel embedding of distributions can be found in.$

IBM alignment models are a sequence of increasingly complex models used in statistical machine translation to train a translation model and an alignment model, starting with lexical translation probabilities and moving to reordering and word duplication. They underpinned the majority of statistical machine translation systems for almost twenty years starting in the early 1990s, until neural machine translation began to dominate. These models offer principled probabilistic formulation and (mostly) tractable inference.

Dependency networks (DNs) are graphical models, similar to Markov networks, wherein each vertex (node) corresponds to a random variable and each edge captures dependencies among variables. Unlike Bayesian networks, DNs may contain cycles. Each node is associated to a conditional probability table, which determines the realization of the random variable given its parents.

In machine learning, diffusion models, also known as diffusion probabilistic models or score-based generative models, are a class of latent variable generative models. A diffusion model consists of three major components: the forward process, the reverse process, and the sampling procedure. The goal of diffusion models is to learn a diffusion process that generates a probability distribution for a given dataset from which we can then sample new images. They learn the latent structure of a dataset by modeling the way in which data points diffuse through their latent space.

References

↑ Peng, Jian-Xun, Kang Li, and De-Shuang Huang. "A hybrid forward algorithm for RBF neural network construction." Neural Networks, IEEE Transactions on 17.6 (2006): 1439-1451.
↑ Zhang, Ping, and Christos G. Cassandras. "An improved forward algorithm for optimal control of a class of hybrid systems." Automatic Control, IEEE Transactions on 47.10 (2002): 1735-1739.
↑ Peng, Jian-Xun, Kang Li, and George W. Irwin. "A novel continuous forward algorithm for RBF neural modelling." Automatic Control, IEEE Transactions on 52.1 (2007): 117-122.
1 2 Lawrence R. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition". Proceedings of the IEEE , 77 (2), p. 257–286, February 1989. 10.1109/5.18626
↑ Zhang, Yanxue, Dongmei Zhao, and Jinxing Liu. "The Application of Baum-Welch Algorithm in Multistep Attack." The Scientific World Journal 2014.

Softwares

Hidden Markov Model R-Package contains functionality for computing and retrieving forward procedure
momentuHMM R-Package provides tools for using and inferring HMMs.
GHMM Library for Python
The hmm package Haskell library for HMMS, implements Forward algorithm.
Library for Java contains Machine Learning and Artificial Intelligence algorithm implementations.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Peng, Jian-Xun, Kang Li, and De-Shuang Huang. "A hybrid forward algorithm for RBF neural network construction." Neural Networks, IEEE Transactions on 17.6 (2006): 1439-1451.

[2] Zhang, Ping, and Christos G. Cassandras. "An improved forward algorithm for optimal control of a class of hybrid systems." Automatic Control, IEEE Transactions on 47.10 (2002): 1735-1739.

[3] Peng, Jian-Xun, Kang Li, and George W. Irwin. "A novel continuous forward algorithm for RBF neural modelling." Automatic Control, IEEE Transactions on 52.1 (2007): 117-122.

[speechRecognition-4] 1 2 Lawrence R. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition". Proceedings of the IEEE , 77 (2), p. 257–286, February 1989. 10.1109/5.18626

[5] Zhang, Yanxue, Dongmei Zhao, and Jinxing Liu. "The Application of Baum-Welch Algorithm in Multistep Attack." The Scientific World Journal 2014.

[1]

[2]

[3]

[4]

[5]