Stochastic chains with memory of variable length

Last updated April 02, 2024

Stochastic chains with memory of variable length are a family of stochastic chains of finite order in a finite alphabet, such as, for every time pass, only one finite suffix of the past, called context, is necessary to predict the next symbol. These models were introduced in the information theory literature by Jorma Rissanen in 1983,^[1] as a universal tool to data compression, but recently have been used to model data in different areas such as biology,^[2] linguistics ^[3] and music.^[4]

Definition

A stochastic chain with memory of variable length is a stochastic chain $(X_{n})_{n\in Z}$ , taking values in a finite alphabet $A$ , and characterized by a probabilistic context tree $(\tau ,p)$ , so that

$\tau$ is the group of all contexts. A context $X_{n-l},\ldots ,X_{n-1}$ , being $l$ the size of the context, is a finite portion of the past $X_{-\infty },\ldots ,X_{n-1}$ , which is relevant to predict the next symbol $X_{n}$ ;
$p$ is a family of transition probabilities associated with each context.

History

The class of stochastic chains with memory of variable length was introduced by Jorma Rissanen in the article A universal data compression system.^[1] Such class of stochastic chains was popularized in the statistical and probabilistic community by P. Bühlmann and A. J. Wyner in 1999, in the article Variable Length Markov Chains. Named by Bühlmann and Wyner as “variable length Markov chains” (VLMC), these chains are also known as “variable-order Markov models" (VOM), “probabilistic suffix trees”^[2] and “context tree models”.^[5] The name “stochastic chains with memory of variable length” seems to have been introduced by Galves and Löcherbach, in 2008, in the article of the same name.^[6]

Examples

Interrupted light source

Consider a system by a lamp, an observer and a door between both of them. The lamp has two possible states: on, represented by 1, or off, represented by 0. When the lamp is on, the observer may see the light through the door, depending on which state the door is at the time: open, 1, or closed, 0. such states are independent of the original state of the lamp.

Let $(X_{n})_{n\geq 0}$ a Markov chain that represents the state of the lamp, with values in $A={0,1}$ and let $p$ be a probability transition matrix. Also, let $(\xi _{n})_{n\geq 0}$ be a sequence of independent random variables that represents the door's states, also taking values in $A$ , independent of the chain $(X_{n})_{n\geq 0}$ and such that

\mathbb {P} (\xi _{n}=1)=1-\varepsilon

where $0<\epsilon <1$ . Define a new sequence $(Z_{n})_{n\geq 0}$ such that

Z_{n}=X_{n}\xi _{n}

for every

(Z_{n})_{n\geq 0}.

In order to determine the last instant that the observer could see the lamp on, i.e. to identify the least instant $k$ , with $k<n$ in which $Z_{k}=1$ .

Using a context tree it's possible to represent the past states of the sequence, showing which are relevant to identify the next state.

The stochastic chain $(Z_{n})_{n\in \mathbb {Z} }$ is, then, a chain with memory of variable length, taking values in $A$ and compatible with the probabilistic context tree $(\tau ,p)$ , where

\tau =\{1,10,100,\cdots \}\cup \{0^{\infty }\}.

Inferences in chains with variable length

Given a sample $X_{l},\ldots ,X_{n}$ , one can find the appropriated context tree using the following algorithms.

The context algorithm

In the article A Universal Data Compression System,^[1] Rissanen introduced a consistent algorithm to estimate the probabilistic context tree that generates the data. This algorithm's function can be summarized in two steps:

Given the sample produced by a chain with memory of variable length, we start with the maximum tree whose branches are all the candidates to contexts to the sample;
The branches in this tree are then cut until you obtain the smallest tree that's well adapted to the data. Deciding whether or not shortening the context is done through a given gain function, such as the ratio of the log-likelihood.

Be $X_{0},\ldots ,X_{n-1}$ a sample of a finite probabilistic tree $(\tau ,p)$ . For any sequence $x_{-j}^{-1}$ with $j\leq n$ , it is possible to denote by $N_{n}(x_{-j}^{-1})$ the number of occurrences of the sequence in the sample, i.e.,

N_{n}(x_{-j}^{-1})=\sum _{t=0}^{n-j}\mathbf {1} \left\{X_{t}^{t+j-1}=x_{-j}^{-1}\right\}

Rissanen first built a context maximum candidate, given by $X_{n-K(n)}^{n-1}$ , where $K(n)=C\log {n}$ and $C$ is an arbitrary positive constant. The intuitive reason for the choice of $C\log {n}$ comes from the impossibility of estimating the probabilities of sequence with lengths greater than $\log {n}$ based in a sample of size $n$ .

From there, Rissanen shortens the maximum candidate through successive cutting the branches according to a sequence of tests based in statistical likelihood ratio. In a more formal definition, if bANnxk1b0 define the probability estimator of the transition probability $p$ by

{\hat {p}}_{n}(a\mid x_{-k}^{-1})={\frac {N_{n}(x_{-k}^{-1}a)}{\sum _{b\in A}N_{n}(x_{-k}^{-1}b)}}

where $x_{-j}^{-1}a=(x_{-j},\ldots ,x_{-1},a)$ . If $\sum _{b\in A}N_{n}(x_{-k}^{-1}b)\,=\,0$ , define ${\hat {p}}_{n}(a\mid x_{-k}^{-1})\,=\,1/|A|$ .

To $i\geq 1$ , define

\Lambda _{n}(x_{-i}^{-1})\,=\,2\,\sum _{y\in A}\sum _{a\in A}N_{n}(yx_{-i}^{-1}a)\log \left[{\frac {{\hat {p}}_{n}(a\mid x_{-i}^{-1}y)}{{\hat {p}}_{n}(a\mid x_{-i}^{-1})}}\right]\,

where $yx_{-i}^{-1}=(y,x_{-i},\ldots ,x_{-1})$ and

{\hat {p}}_{n}(a\mid x_{-i}^{-1}y)={\frac {N_{n}(yx_{-i}^{-1}a)}{\sum _{b\in A}N_{n}(yx_{-i}^{-1}b)}}.

Note that $\Lambda _{n}(x_{-i}^{-1})$ is the ratio of the log-likelihood to test the consistency of the sample with the probabilistic context tree $(\tau ,p)$ against the alternative that is consistent with $(\tau ',p')$ , where $\tau$ and $\tau '$ differ only by a set of sibling knots.

The length of the current estimated context is defined by

{\hat {\ell }}_{n}(X_{0}^{n-1})=\max \left\{i=1,\ldots ,K(n):\Lambda _{n}(X_{n-i}^{n-1})\,>\,C\log n\right\}\,

where $C$ is any positive constant. At last, by Rissanen,^[1] there's the following result. Given $X_{0},\ldots ,X_{n-1}$ of a finite probabilistic context tree $(\tau ,p)$ , then

P\left({\hat {\ell }}_{n}(X_{0}^{n-1})\neq \ell (X_{0}^{n-1})\right)\longrightarrow 0,

when $n\rightarrow \infty$ .

Bayesian information criterion (BIC)

The estimator of the context tree by BIC with a penalty constant $c>0$ is defined as

{\hat {\tau }}_{\mathrm {BIC} }={\underset {\tau \in {\mathcal {T}}_{n}}{\arg \max }}\{\log L_{\tau }(X_{1}^{n})-c\,{\textrm {d}}f(\tau )\log n\}

Smallest maximizer criterion (SMC)

The smallest maximizer criterion^[3] is calculated by selecting the smallest tree τ of a set of champion trees C such that

\lim _{n\to \infty }{\frac {\log L_{\tau }(X_{1}^{n})-\log L_{\hat {\tau }}(X_{1}^{n})}{n}}=0

Related Research Articles

Autocorrelation, sometimes known as serial correlation in the discrete time case, is the correlation of a signal with a delayed copy of itself as a function of delay. Informally, it is the similarity between observations of a random variable as a function of the time lag between them. The analysis of autocorrelation is a mathematical tool for finding repeating patterns, such as the presence of a periodic signal obscured by noise, or identifying the missing fundamental frequency in a signal implied by its harmonic frequencies. It is often used in signal processing for analyzing functions or series of values, such as time domain signals.

A Markov chain or Markov process is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Informally, this may be thought of as, "What happens next depends only on the state of affairs now." A countably infinite sequence, in which the chain moves state at discrete time steps, gives a discrete-time Markov chain (DTMC). A continuous-time process is called a continuous-time Markov chain (CTMC). It is named after the Russian mathematician Andrey Markov.

A hidden Markov model (HMM) is a Markov model in which the observations are dependent on a latent Markov process. An HMM requires that there be an observable process $whose outcomes depend on the outcomes of in a known way. Since cannot be observed directly, the goal is to learn about state of by observing By definition of being a Markov model, an HMM has an additional requirement that the outcome of at time must be "influenced" exclusively by the outcome of at and that the outcomes of and at must be conditionally independent of at given at time Estimation of the parameters in an HMM can be performed using maximum likelihood. For linear chain HMMs, the Baum-Welch algorithm can be used to estimate the parameters.$

A Bayesian network is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). While it is one of several forms of causal notation, causal networks are special cases of Bayesian networks. Bayesian networks are ideal for taking an event that occurred and predicting the likelihood that any one of several possible known causes was the contributing factor. For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases.

<span class="mw-page-title-main">Martingale (probability theory)</span> Model in probability theory

In probability theory, a martingale is a sequence of random variables for which, at a particular time, the conditional expectation of the next value in the sequence is equal to the present value, regardless of all prior values.

In probability theory and statistics, the term Markov property refers to the memoryless property of a stochastic process, which means that its future evolution is independent of its history. It is named after the Russian mathematician Andrey Markov. The term strong Markov property is similar to the Markov property, except that the meaning of "present" is defined in terms of a random variable known as a stopping time.

In mathematics and statistics, a stationary process is a stochastic process whose unconditional joint probability distribution does not change when shifted in time. Consequently, parameters such as mean and variance also do not change over time. If you draw a line through the middle of a stationary process then it should be flat; it may have 'seasonal' cycles around the trend line, but overall it does not trend up nor down.

A graphical model or probabilistic graphical model (PGM) or structured probabilistic model is a probabilistic model for which a graph expresses the conditional dependence structure between random variables. They are commonly used in probability theory, statistics—particularly Bayesian statistics—and machine learning.

Variational Bayesian methods are a family of techniques for approximating intractable integrals arising in Bayesian inference and machine learning. They are typically used in complex statistical models consisting of observed variables as well as unknown parameters and latent variables, with various sorts of relationships among the three types of random variables, as might be described by a graphical model. As typical in Bayesian inference, the parameters and latent variables are grouped together as "unobserved variables". Variational Bayesian methods are primarily used for two purposes:

To provide an analytical approximation to the posterior probability of the unobserved variables, in order to do statistical inference over these variables.
To derive a lower bound for the marginal likelihood of the observed data. This is typically used for performing model selection, the general idea being that a higher marginal likelihood for a given model indicates a better fit of the data by that model and hence a greater probability that the model in question was the one that generated the data.

In mathematics, the Bernoulli scheme or Bernoulli shift is a generalization of the Bernoulli process to more than two possible outcomes. Bernoulli schemes appear naturally in symbolic dynamics, and are thus important in the study of dynamical systems. Many important dynamical systems exhibit a repellor that is the product of the Cantor set and a smooth manifold, and the dynamics on the Cantor set are isomorphic to that of the Bernoulli shift. This is essentially the Markov partition. The term shift is in reference to the shift operator, which may be used to study Bernoulli schemes. The Ornstein isomorphism theorem shows that Bernoulli shifts are isomorphic when their entropy is equal.

Estimation of distribution algorithms (EDAs), sometimes called probabilistic model-building genetic algorithms (PMBGAs), are stochastic optimization methods that guide the search for the optimum by building and sampling explicit probabilistic models of promising candidate solutions. Optimization is viewed as a series of incremental updates of a probabilistic model, starting with the model encoding an uninformative prior over admissible solutions and ending with the model that generates only the global optima.

In probability, a discrete-time Markov chain (DTMC) is a sequence of random variables, known as a stochastic process, in which the value of the next variable depends only on the value of the current variable, and not any variables in the past. For instance, a machine may have two states, A and E. When it is in state A, there is a 40% chance of it moving to state E and a 60% chance of it remaining in state A. When it is in state E, there is a 70% chance of it moving to A and a 30% chance of it staying in E. The sequence of states of the machine is a Markov chain. If we denote the chain by $then is the state which the machine starts in and is the random variable describing its state after 10 transitions. The process continues forever, indexed by the natural numbers.$

Markov renewal processes are a class of random processes in probability and statistics that generalize the class of Markov jump processes. Other classes of random processes, such as Markov chains and Poisson processes, can be derived as special cases among the class of Markov renewal processes, while Markov renewal processes are special cases among the more general class of renewal processes.

In mathematics, the theory of optimal stopping or early stopping is concerned with the problem of choosing a time to take a particular action, in order to maximise an expected reward or minimise an expected cost. Optimal stopping problems can be found in areas of statistics, economics, and mathematical finance. A key example of an optimal stopping problem is the secretary problem. Optimal stopping problems can often be written in the form of a Bellman equation, and are therefore often solved using dynamic programming.

In game theory, a stochastic game, introduced by Lloyd Shapley in the early 1950s, is a repeated game with probabilistic transitions played by one or more players. The game is played in a sequence of stages. At the beginning of each stage the game is in some state. The players select actions and each player receives a payoff that depends on the current state and the chosen actions. The game then moves to a new random state whose distribution depends on the previous state and the actions chosen by the players. The procedure is repeated at the new state and play continues for a finite or infinite number of stages. The total payoff to a player is often taken to be the discounted sum of the stage payoffs or the limit inferior of the averages of the stage payoffs.

In the mathematical theory of stochastic processes, variable-order Markov (VOM) models are an important class of models that extend the well known Markov chain models. In contrast to the Markov chain models, where each random variable in a sequence with a Markov property depends on a fixed number of random variables, in VOM models this number of conditioning random variables may vary based on the specific observed realization.

The discrete phase-type distribution is a probability distribution that results from a system of one or more inter-related geometric distributions occurring in sequence, or phases. The sequence in which each of the phases occur may itself be a stochastic process. The distribution can be represented by a random variable describing the time until absorption of an absorbing Markov chain with one absorbing state. Each of the states of the Markov chain represents one of the phases.

In the mathematical theory of probability, the entropy rate or source information rate is a function assigning an entropy to a stochastic process.

In probability theory, Kolmogorov's criterion, named after Andrey Kolmogorov, is a theorem giving a necessary and sufficient condition for a Markov chain or continuous-time Markov chain to be stochastically identical to its time-reversed version.

In the mathematical study of stochastic processes, a Harris chain is a Markov chain where the chain returns to a particular part of the state space an unbounded number of times. Harris chains are regenerative processes and are named after Theodore Harris. The theory of Harris chains and Harris recurrence is useful for treating Markov chains on general state spaces.

References

1 2 3 4 Rissanen, J (Sep 1983). "A Universal Data Compression System". IEEE Transactions on Information Theory. 29 (5): 656–664. doi:10.1109/TIT.1983.1056741.
1 2 Bejenaro, G (2001). "Variations on probabilistic suffix trees: statistical modeling and prediction of protein families". Bioinformatics. 17 (5): 23–43. doi: 10.1093/bioinformatics/17.1.23 . PMID 11222260.
1 2 Galves A, Galves C, Garcia J, Garcia NL, Leonardi F (2012). "Context tree selection and linguistic rhythm retrieval from written texts". The Annals of Applied Statistics. 6 (5): 186–209. arXiv: 0902.3619 . doi:10.1214/11-AOAS511.
↑ Dubnov S, Assayag G, Lartillot O, Bejenaro G (2003). "Using machine-learning methods for musical style modeling". Computer. 36 (10): 73–80. CiteSeerX 10.1.1.628.4614 . doi:10.1109/MC.2003.1236474.
↑ Galves A, Garivier A, Gassiat E (2012). "Joint estimation of intersecting context tree models". Scandinavian Journal of Statistics. 40 (2): 344–362. arXiv: 1102.0673 . doi:10.1111/j.1467-9469.2012.00814.x.
↑ Galves A, Löcherbach E (2008). "Stochastic chains with memory of variable length". TICSP Series. 38: 117–133. arXiv: 0804.2050 .

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[Rissanen-1] 1 2 3 4 Rissanen, J (Sep 1983). "A Universal Data Compression System". IEEE Transactions on Information Theory. 29 (5): 656–664. doi:10.1109/TIT.1983.1056741.

[Bejenaro-2] 1 2 Bejenaro, G (2001). "Variations on probabilistic suffix trees: statistical modeling and prediction of protein families". Bioinformatics. 17 (5): 23–43. doi: 10.1093/bioinformatics/17.1.23 . PMID 11222260.

[Galves-3] 1 2 Galves A, Galves C, Garcia J, Garcia NL, Leonardi F (2012). "Context tree selection and linguistic rhythm retrieval from written texts". The Annals of Applied Statistics. 6 (5): 186–209. arXiv: 0902.3619 . doi:10.1214/11-AOAS511.

[Dubnov-4] Dubnov S, Assayag G, Lartillot O, Bejenaro G (2003). "Using machine-learning methods for musical style modeling". Computer. 36 (10): 73–80. CiteSeerX 10.1.1.628.4614 . doi:10.1109/MC.2003.1236474.

[Galves2-5] Galves A, Garivier A, Gassiat E (2012). "Joint estimation of intersecting context tree models". Scandinavian Journal of Statistics. 40 (2): 344–362. arXiv: 1102.0673 . doi:10.1111/j.1467-9469.2012.00814.x.

[Galves3-6] Galves A, Löcherbach E (2008). "Stochastic chains with memory of variable length". TICSP Series. 38: 117–133. arXiv: 0804.2050 .

[1]

[2]

[3]

[4]

[5]

[6]

v t e Stochastic processes
Discrete time	Bernoulli process Branching process Chinese restaurant process Galton–Watson process Independent and identically distributed random variables Markov chain Moran process Random walk Loop-erased Self-avoiding Biased Maximal entropy
Continuous time	Additive process Bessel process Birth–death process pure birth Brownian motion Bridge Excursion Fractional Geometric Meander Cauchy process Contact process Continuous-time random walk Cox process Diffusion process Dyson Brownian motion Empirical process Feller process Fleming–Viot process Gamma process Geometric process Hawkes process Hunt process Interacting particle systems Itô diffusion Itô process Jump diffusion Jump process Lévy process Local time Markov additive process McKean–Vlasov process Ornstein–Uhlenbeck process Poisson process Compound Non-homogeneous Schramm–Loewner evolution Semimartingale Sigma-martingale Stable process Superprocess Telegraph process Variance gamma process Wiener process Wiener sausage
Both	Branching process Galves–Löcherbach model Gaussian process Hidden Markov model (HMM) Markov process Martingale Differences Local Sub- Super- Random dynamical system Regenerative process Renewal process Stochastic chains with memory of variable length White noise
Fields and other	Dirichlet process Gaussian random field Gibbs measure Hopfield model Ising model Potts model Boolean network Markov random field Percolation Pitman–Yor process Point process Cox Poisson Random field Random graph
Time series models	Autoregressive conditional heteroskedasticity (ARCH) model Autoregressive integrated moving average (ARIMA) model Autoregressive (AR) model Autoregressive–moving-average (ARMA) model Generalized autoregressive conditional heteroskedasticity (GARCH) model Moving-average (MA) model
Financial models	Binomial options pricing model Black–Derman–Toy Black–Karasinski Black–Scholes Chan–Karolyi–Longstaff–Sanders (CKLS) Chen Constant elasticity of variance (CEV) Cox–Ingersoll–Ross (CIR) Garman–Kohlhagen Heath–Jarrow–Morton (HJM) Heston Ho–Lee Hull–White Korn-Kreer-Lenssen LIBOR market Rendleman–Bartter SABR volatility Vašíček Wilkie
Actuarial models	Bühlmann Cramér–Lundberg Risk process Sparre–Anderson
Queueing models	Bulk Fluid Generalized queueing network M/G/1 M/M/1 M/M/c
Properties	Càdlàg paths Continuous Continuous paths Ergodic Exchangeable Feller-continuous Gauss–Markov Markov Mixing Piecewise-deterministic Predictable Progressively measurable Self-similar Stationary Time-reversible
Limit theorems	Central limit theorem Donsker's theorem Doob's martingale convergence theorems Ergodic theorem Fisher–Tippett–Gnedenko theorem Large deviation principle Law of large numbers (weak/strong) Law of the iterated logarithm Maximal ergodic theorem Sanov's theorem Zero–one laws (Blumenthal, Borel–Cantelli, Engelbert–Schmidt, Hewitt–Savage, Kolmogorov, Lévy)
Inequalities	Burkholder–Davis–Gundy Doob's martingale Doob's upcrossing Kunita–Watanabe Marcinkiewicz–Zygmund
Tools	Cameron–Martin formula Convergence of random variables Doléans-Dade exponential Doob decomposition theorem Doob–Meyer decomposition theorem Doob's optional stopping theorem Dynkin's formula Feynman–Kac formula Filtration Girsanov theorem Infinitesimal generator Itô integral Itô's lemma Karhunen–Loève theorem Kolmogorov continuity theorem Kolmogorov extension theorem Lévy–Prokhorov metric Malliavin calculus Martingale representation theorem Optional stopping theorem Prokhorov's theorem Quadratic variation Reflection principle Skorokhod integral Skorokhod's representation theorem Skorokhod space Snell envelope Stochastic differential equation Tanaka Stopping time Stratonovich integral Uniform integrability Usual hypotheses Wiener space Classical Abstract
Disciplines	Actuarial mathematics Control theory Econometrics Ergodic theory Extreme value theory (EVT) Large deviations theory Mathematical finance Mathematical statistics Probability theory Queueing theory Renewal theory Ruin theory Signal processing Statistics Stochastic analysis Time series analysis Machine learning
List of topics Category