Variable-order Markov model

Last updated January 03, 2024

In the mathematical theory of stochastic processes, variable-order Markov (VOM) models are an important class of models that extend the well known Markov chain models. In contrast to the Markov chain models, where each random variable in a sequence with a Markov property depends on a fixed number of random variables, in VOM models this number of conditioning random variables may vary based on the specific observed realization.

This realization sequence is often called the context; therefore the VOM models are also called context trees.^[1] VOM models are nicely rendered by colorized probabilistic suffix trees (PST).^[2] The flexibility in the number of conditioning random variables turns out to be of real advantage for many applications, such as statistical analysis, classification and prediction.^[3]^[4]^[5]

Example

Consider for example a sequence of random variables, each of which takes a value from the ternary alphabet ${a, b, c}$ . Specifically, consider the string constructed from infinite concatenations of the sub-string $aaabc$ : $aaabcaaabcaaabcaaabc\dotsaaabc$ .

In this example, $Pr(c | ab) = Pr(c | b) = 1.0$ ; therefore, the shorter context $b$ is sufficient to determine the next character. Similarly, the VOM model of maximal order 3 can generate the string exactly using only five conditional probability components, which are all equal to 1.0.

In practical settings there is seldom sufficient data to accurately estimate the exponentially increasing number of conditional probability components as the order of the Markov chain increases.

The variable-order Markov model assumes that in realistic settings, there are certain realizations of states (represented by contexts) in which some past states are independent from the future states; accordingly, "a great reduction in the number of model parameters can be achieved."^[1]

Definition

Let $A$ be a state space (finite alphabet) of size $|A|$ .

Consider a sequence with the Markov property $x_{1}^{n}=x_{1}x_{2}\dots x_{n}$ of $n$ realizations of random variables, where $x_{i}\in A$ is the state (symbol) at position $i$ $\scriptstyle (1\leq i\leq n)$ , and the concatenation of states $x_{i}$ and $x_{i+1}$ is denoted by $x_{i}x_{i+1}$ .

Given a training set of observed states, $x_{1}^{n}$ , the construction algorithm of the VOM models^[3]^[4]^[5] learns a model $P$ that provides a probability assignment for each state in the sequence given its past (previously observed symbols) or future states.

Specifically, the learner generates a conditional probability distribution $P(x_{i}\mid s)$ for a symbol $x_{i}\in A$ given a context $s\in A^{*}$ , where the * sign represents a sequence of states of any length, including the empty context.

VOM models attempt to estimate conditional distributions of the form $P(x_{i}\mid s)$ where the context length $|s|\leq D$ varies depending on the available statistics. In contrast, conventional Markov models attempt to estimate these conditional distributions by assuming a fixed contexts' length $|s|=D$ and, hence, can be considered as special cases of the VOM models.

Effectively, for a given training sequence, the VOM models are found to obtain better model parameterization than the fixed-order Markov models that leads to a better variance-bias tradeoff of the learned models.^[3]^[4]^[5]

Application areas

Various efficient algorithms have been devised for estimating the parameters of the VOM model.^[4]

VOM models have been successfully applied to areas such as machine learning, information theory and bioinformatics, including specific applications such as coding and data compression,^[1] document compression,^[4] classification and identification of DNA and protein sequences,^[6] [3] statistical process control,^[5] spam filtering,^[7] haplotyping,^[8] speech recognition,^[9] sequence analysis in social sciences,^[2] and others.

Related Research Articles

In information theory, the entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent to the variable's possible outcomes. Given a discrete random variable $, which takes values in the alphabet and is distributed according to :$

A Markov chain or Markov process is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Informally, this may be thought of as, "What happens next depends only on the state of affairs now." A countably infinite sequence, in which the chain moves state at discrete time steps, gives a discrete-time Markov chain (DTMC). A continuous-time process is called a continuous-time Markov chain (CTMC). It is named after the Russian mathematician Andrey Markov.

A hidden Markov model (HMM) is a Markov model in which the observations are dependent on a latent Markov process. An HMM requires that there be an observable process $whose outcomes depend on the outcomes of in a known way. Since cannot be observed directly, the goal is to learn about state of by observing By definition of being a Markov model, an HMM has an additional requirement that the outcome of at time must be "influenced" exclusively by the outcome of at and that the outcomes of and at must be conditionally independent of at given at time Estimation of the parameters in an HMM can be performed using maximum likelihood. For linear chain HMMs, the Baum-Welch algorithm can be used to estimate the parameters.$

A Bayesian network is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). While it is one of several forms of causal notation, causal networks are special cases of Bayesian networks. Bayesian networks are ideal for taking an event that occurred and predicting the likelihood that any one of several possible known causes was the contributing factor. For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases.

In probability theory and statistics, the term Markov property refers to the memoryless property of a stochastic process, which means that its future evolution is independent of its history. It is named after the Russian mathematician Andrey Markov. The term strong Markov property is similar to the Markov property, except that the meaning of "present" is defined in terms of a random variable known as a stopping time.

A graphical model or probabilistic graphical model (PGM) or structured probabilistic model is a probabilistic model for which a graph expresses the conditional dependence structure between random variables. They are commonly used in probability theory, statistics—particularly Bayesian statistics—and machine learning.

In statistics, Gibbs sampling or a Gibbs sampler is a Markov chain Monte Carlo (MCMC) algorithm for sampling from a specified multivariate probability distribution when direct sampling from the joint distribution is difficult, but sampling from the conditional distribution is more practical. This sequence can be used to approximate the joint distribution ; to approximate the marginal distribution of one of the variables, or some subset of the variables ; or to compute an integral. Typically, some of the variables correspond to observations whose values are known, and hence do not need to be sampled.

A continuous-time Markov chain (CTMC) is a continuous stochastic process in which, for each state, the process will change state according to an exponential random variable and then move to a different state as specified by the probabilities of a stochastic matrix. An equivalent formulation describes the process as changing state according to the least value of a set of exponential random variables, one for each possible state it can move to, with the parameters determined by the current state.

<span class="mw-page-title-main">Markov random field</span> Set of random variables

In the domain of physics and probability, a Markov random field (MRF), Markov network or undirected graphical model is a set of random variables having a Markov property described by an undirected graph. In other words, a random field is said to be a Markov random field if it satisfies Markov properties. The concept originates from the Sherrington–Kirkpatrick model.

In bioinformatics, GLIMMER (Gene Locator and Interpolated Markov ModelER) is used to find genes in prokaryotic DNA. "It is effective at finding genes in bacteria, archea, viruses, typically finding 98-99% of all relatively long protein coding genes". GLIMMER was the first system that used the interpolated Markov model to identify coding regions. The GLIMMER software is open source and is maintained by Steven Salzberg, Art Delcher, and their colleagues at the Center for Computational Biology at Johns Hopkins University. The original GLIMMER algorithms and software were designed by Art Delcher, Simon Kasif and Steven Salzberg and applied to bacterial genome annotation in collaboration with Owen White.

In probability, a discrete-time Markov chain (DTMC) is a sequence of random variables, known as a stochastic process, in which the value of the next variable depends only on the value of the current variable, and not any variables in the past. For instance, a machine may have two states, A and E. When it is in state A, there is a 40% chance of it moving to state E and a 60% chance of it remaining in state A. When it is in state E, there is a 70% chance of it moving to A and a 30% chance of it staying in E. The sequence of states of the machine is a Markov chain. If we denote the chain by $then is the state which the machine starts in and is the random variable describing its state after 10 transitions. The process continues forever, indexed by the natural numbers.$

Conditional random fields (CRFs) are a class of statistical modeling methods often applied in pattern recognition and machine learning and used for structured prediction. Whereas a classifier predicts a label for a single sample without considering "neighbouring" samples, a CRF can take context into account. To do so, the predictions are modelled as a graphical model, which represents the presence of dependencies between the predictions. What kind of graph is used depends on the application. For example, in natural language processing, "linear chain" CRFs are popular, for which each prediction is dependent only on its immediate neighbours. In image processing, the graph typically connects locations to nearby and/or similar locations to enforce that they receive similar predictions.

In probability theory, random element is a generalization of the concept of random variable to more complicated spaces than the simple real line. The concept was introduced by Maurice Fréchet (1948) who commented that the “development of probability theory and expansion of area of its applications have led to necessity to pass from schemes where (random) outcomes of experiments can be described by number or a finite set of numbers, to schemes where outcomes of experiments represent, for example, vectors, functions, processes, fields, series, transformations, and also sets or collections of sets.”

Markov renewal processes are a class of random processes in probability and statistics that generalize the class of Markov jump processes. Other classes of random processes, such as Markov chains and Poisson processes, can be derived as special cases among the class of Markov renewal processes, while Markov renewal processes are special cases among the more general class of renewal processes.

In the mathematical theory of probability, the entropy rate or source information rate is a function assigning an entropy to a stochastic process.

The Hammersley–Clifford theorem is a result in probability theory, mathematical statistics and statistical mechanics that gives necessary and sufficient conditions under which a strictly positive probability distribution can be represented as events generated by a Markov network. It is the fundamental theorem of random fields. It states that a probability distribution that has a strictly positive mass or density satisfies one of the Markov properties with respect to an undirected graph G if and only if it is a Gibbs random field, that is, its density can be factorized over the cliques of the graph.

In probability theory, a Markov model is a stochastic model used to model pseudo-randomly changing systems. It is assumed that future states depend only on the current state, not on the events that occurred before it. Generally, this assumption enables reasoning and computation with the model that would otherwise be intractable. For this reason, in the fields of predictive modelling and probabilistic forecasting, it is desirable for a given model to exhibit the Markov property.

In probability theory, an additive Markov chain is a Markov chain with an additive conditional probability function. Here the process is a discrete-time Markov chain of order m and the transition probability to a state at the next time is a sum of functions, each depending on the next state and one of the m previous states.

Graphical models have become powerful frameworks for protein structure prediction, protein–protein interaction, and free energy calculations for protein structures. Using a graphical model to represent the protein structure allows the solution of many problems including secondary structure prediction, protein-protein interactions, protein-drug interaction, and free energy calculations.

Stochastic chains with memory of variable length are a family of stochastic chains of finite order in a finite alphabet, such as, for every time pass, only one finite suffix of the past, called context, is necessary to predict the next symbol. These models were introduced in the information theory literature by Jorma Rissanen in 1983, as a universal tool to data compression, but recently have been used to model data in different areas such as biology, linguistics and music.

References

1 2 3 Rissanen, J. (Sep 1983). "A Universal Data Compression System". IEEE Transactions on Information Theory. 29 (5): 656–664. doi:10.1109/TIT.1983.1056741.
1 2 Gabadinho, Alexis; Ritschard, Gilbert (2016). "Analyzing State Sequences with Probabilistic Suffix Trees: The PST R Package". Journal of Statistical Software. 72 (3). doi: 10.18637/jss.v072.i03 . ISSN 1548-7660. S2CID 63681202.
1 2 3 4 Shmilovici, A.; Ben-Gal, I. (2007). "Using a VOM Model for Reconstructing Potential Coding Regions in EST Sequences". Computational Statistics. 22 (1): 49–69. doi:10.1007/s00180-007-0021-8. S2CID 2737235.
1 2 3 4 5 Begleiter, R.; El-Yaniv, R.; Yona, G. (2004). "On Prediction Using Variable Order Markov models". Journal of Artificial Intelligence Research. 22: 385–421. arXiv: 1107.0051 . doi: 10.1613/jair.1491 .
1 2 3 4 Ben-Gal, I.; Morag, G.; Shmilovici, A. (2003). "Context-Based Statistical Process Control: A Monitoring Procedure for State-Dependent Processes" (PDF). Technometrics. 45 (4): 293–311. doi:10.1198/004017003000000122. ISSN 0040-1706. S2CID 5227793.
↑ Grau J.; Ben-Gal I.; Posch S.; Grosse I. (2006). "VOMBAT: Prediction of Transcription Factor Binding Sites using Variable Order Bayesian Trees" (PDF). Nucleic Acids Research. Nucleic Acids Research, vol. 34, issue W529–W533. 34 (Web Server issue): W529-33. doi:10.1093/nar/gkl212. PMC 1538886 . PMID 16845064.
↑ Bratko, A.; Cormack, G. V.; Filipic, B.; Lynam, T.; Zupan, B. (2006). "Spam Filtering Using Statistical Data Compression Models" (PDF). Journal of Machine Learning Research. 7: 2673–2698.
↑ Browning, Sharon R. "Multilocus association mapping using variable-length Markov chains." The American Journal of Human Genetics 78.6 (2006): 903–913.
↑ Smith, A.; Denenberg, J.; Slack, T.; Tan, C.; Wohlford, R. (1985). "Application of a sequential pattern learning system to connected speech recognition". ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing. Vol. 10. Tampa, FL, USA: Institute of Electrical and Electronics Engineers. pp. 1201–1204. doi:10.1109/ICASSP.1985.1168282. S2CID 60991068.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[Rissanen-1] 1 2 3 Rissanen, J. (Sep 1983). "A Universal Data Compression System". IEEE Transactions on Information Theory. 29 (5): 656–664. doi:10.1109/TIT.1983.1056741.

[:0-2] 1 2 Gabadinho, Alexis; Ritschard, Gilbert (2016). "Analyzing State Sequences with Probabilistic Suffix Trees: The PST R Package". Journal of Statistical Software. 72 (3). doi: 10.18637/jss.v072.i03 . ISSN 1548-7660. S2CID 63681202.

[Shmilovici-3] 1 2 3 4 Shmilovici, A.; Ben-Gal, I. (2007). "Using a VOM Model for Reconstructing Potential Coding Regions in EST Sequences". Computational Statistics. 22 (1): 49–69. doi:10.1007/s00180-007-0021-8. S2CID 2737235.

[Begleiter-4] 1 2 3 4 5 Begleiter, R.; El-Yaniv, R.; Yona, G. (2004). "On Prediction Using Variable Order Markov models". Journal of Artificial Intelligence Research. 22: 385–421. arXiv: 1107.0051 . doi: 10.1613/jair.1491 .

[Ben-Gal-5] 1 2 3 4 Ben-Gal, I.; Morag, G.; Shmilovici, A. (2003). "Context-Based Statistical Process Control: A Monitoring Procedure for State-Dependent Processes" (PDF). Technometrics. 45 (4): 293–311. doi:10.1198/004017003000000122. ISSN 0040-1706. S2CID 5227793.

[6] Grau J.; Ben-Gal I.; Posch S.; Grosse I. (2006). "VOMBAT: Prediction of Transcription Factor Binding Sites using Variable Order Bayesian Trees" (PDF). Nucleic Acids Research. Nucleic Acids Research, vol. 34, issue W529–W533. 34 (Web Server issue): W529-33. doi:10.1093/nar/gkl212. PMC 1538886 . PMID 16845064.

[Bratko-7] Bratko, A.; Cormack, G. V.; Filipic, B.; Lynam, T.; Zupan, B. (2006). "Spam Filtering Using Statistical Data Compression Models" (PDF). Journal of Machine Learning Research. 7: 2673–2698.

[8] Browning, Sharon R. "Multilocus association mapping using variable-length Markov chains." The American Journal of Human Genetics 78.6 (2006): 903–913.

[9] Smith, A.; Denenberg, J.; Slack, T.; Tan, C.; Wohlford, R. (1985). "Application of a sequential pattern learning system to connected speech recognition". ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing. Vol. 10. Tampa, FL, USA: Institute of Electrical and Electronics Engineers. pp. 1201–1204. doi:10.1109/ICASSP.1985.1168282. S2CID 60991068.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]