Variable elimination

Last updated April 23, 2024

Variable elimination (VE) is a simple and general exact inference algorithm in probabilistic graphical models, such as Bayesian networks and Markov random fields.^[1] It can be used for inference of maximum a posteriori (MAP) state or estimation of conditional or marginal distributions over a subset of variables. The algorithm has exponential time complexity, but could be efficient in practice for low-treewidth graphs, if the proper elimination order is used.

Factors

Enabling a key reduction in algorithmic complexity, a factor $f$ , also known as a potential, of variables $V$ is a relation between each instantiation of $v$ of variables $f$ to a non-negative number, commonly denoted as $f(x)$ .^[2] A factor does not necessarily have a set interpretation. One may perform operations on factors of different representations such as a probability distribution or conditional distribution.^[2] Joint distributions often become too large to handle as the complexity of this operation is exponential. Thus variable elimination becomes more feasible when computing factorized entities.

Basic Operations

Variable Summation

Algorithm 1, called sum-out (SO), or marginalization, eliminates a single variable $v$ from a set $\phi$ of factors,^[3] and returns the resulting set of factors. The algorithm collect-relevant simply returns those factors in $\phi$ involving variable $v$ .

Algorithm 1 sum-out( $v$ , $\phi$ )

\Phi

= collect factors relevant to

v

\Psi

= the product of all factors in

\Phi

\tau =\sum _{v}\Psi

return $(\phi -\Phi )\cup \{\tau \}$

Example

Here we have a joint probability distribution. A variable, $v$ can be summed out between a set of instantiations where the set $V-v$ at minimum must agree over the remaining variables. The value of $v$ is irrelevant when it is the variable to be summed out.^[2]

$V_{1}$	$V_{2}$	$V_{3}$	$V_{4}$	$V_{5}$	$Pr(.)$
true	true	true	false	false	0.80
false	true	true	false	false	0.20

After eliminating $V_{1}$ , its reference is excluded and we are left with a distribution only over the remaining variables and the sum of each instantiation.

$V_{2}$	$V_{3}$	$V_{4}$	$V_{5}$	$Pr(.)$
true	true	false	false	1.0

The resulting distribution which follows the sum-out operation only helps to answer queries that do not mention $V_{1}$ .^[2] Also worthy to note, the summing-out operation is commutative.

Factor Multiplication

Computing a product between multiple factors results in a factor compatible with a single instantiation in each factor.^[2]

Algorithm 2 mult-factors( $v$ , $\phi$ )^[2]

Z

= Union of all variables between product of factors

f_{1}(X_{1}),...,f_{m}(X_{m})

f

= a factor over

f

where

f

for all

f

For each instantiation

z

For 1 to

m

x_{1}=

instantiation of variables

X_{1}

consistent with

z

f(z)=f(z)f_{i}(x_{i})

return

f

Factor multiplication is not only commutative but also associative.

Inference

The most common query type is in the form $p(X|E=e)$ where $X$ and $E$ are disjoint subsets of $U$ , and $E$ is observed taking value $e$ . A basic algorithm to computing p(X|E = e) is called variable elimination (VE), first put forth in.^[1]

Taken from,^[1] this algorithm computes $p(X|E=e)$ from a discrete Bayesian network B. VE calls SO to eliminate variables one by one. More specifically, in Algorithm 2, $\phi$ is the set C of conditional probability tables (henceforth "CPTs") for B, $X$ is a list of query variables, $E$ is a list of observed variables, $e$ is the corresponding list of observed values, and $\sigma$ is an elimination ordering for variables $U-XE$ , where $XE$ denotes $X\cup E$ .

Variable Elimination Algorithm VE( $\phi ,X,E,e,\sigma$ )

Multiply factors with appropriate CPTs while σ is not empty

Remove the first variable

v

from

\sigma

\phi

= sum-out

(v,\phi )

p(X,E=e)

= the product of all factors

\Psi \in \phi

return $p(X,E=e)/\sum _{X}p(X,E=e)$

Ordering

Finding the optimal order in which to eliminate variables is an NP-hard problem. As such there are heuristics one may follow to better optimize performance by order:

Minimum Degree: Eliminate the variable which results in constructing the smallest factor possible.^[2]
Minimum Fill: By constructing an undirected graph showing variable relations expressed by all CPTs, eliminate the variable which would result in the least edges to be added post elimination.^[2]

Related Research Articles

Bra–ket notation, also called Dirac notation, is a notation for linear algebra and linear operators on complex vector spaces together with their dual space both in the finite-dimensional and infinite-dimensional case. It is specifically designed to ease the types of calculations that frequently come up in quantum mechanics. Its use in quantum mechanics is quite widespread.

In theoretical physics, a Feynman diagram is a pictorial representation of the mathematical expressions describing the behavior and interaction of subatomic particles. The scheme is named after American physicist Richard Feynman, who introduced the diagrams in 1948. The interaction of subatomic particles can be complex and difficult to understand; Feynman diagrams give a simple visualization of what would otherwise be an arcane and abstract formula. According to David Kaiser, "Since the middle of the 20th century, theoretical physicists have increasingly turned to this tool to help them undertake critical calculations. Feynman diagrams have revolutionized nearly every aspect of theoretical physics." While the diagrams are applied primarily to quantum field theory, they can also be used in other areas of physics, such as solid-state theory. Frank Wilczek wrote that the calculations that won him the 2004 Nobel Prize in Physics "would have been literally unthinkable without Feynman diagrams, as would [Wilczek's] calculations that established a route to production and observation of the Higgs particle."

The proof of Gödel's completeness theorem given by Kurt Gödel in his doctoral dissertation of 1929 is not easy to read today; it uses concepts and formalisms that are no longer used and terminology that is often obscure. The version given below attempts to represent all the steps in the proof and all the important ideas faithfully, while restating the proof in the modern language of mathematical logic. This outline should not be considered a rigorous proof of the theorem.

In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is

Shor's algorithm is a quantum algorithm for finding the prime factors of an integer. It was developed in 1994 by the American mathematician Peter Shor. It is one of the few known quantum algorithms with compelling potential applications and strong evidence of superpolynomial speedup compared to best known classical algorithms. On the other hand, factoring numbers of practical significance requires far more qubits than available in the near future. Another concern is that noise in quantum circuits may undermine results, requiring additional qubits for quantum error correction.

A Bayesian network is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). While it is one of several forms of causal notation, causal networks are special cases of Bayesian networks. Bayesian networks are ideal for taking an event that occurred and predicting the likelihood that any one of several possible known causes was the contributing factor. For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases.

In estimation theory and statistics, the Cramér–Rao bound (CRB) relates to estimation of a deterministic parameter. The result is named in honor of Harald Cramér and C. R. Rao, but has also been derived independently by Maurice Fréchet, Georges Darmois, and by Alexander Aitken and Harold Silverstone. It is also known as Fréchet-Cramér–Rao or Fréchet-Darmois-Cramér-Rao lower bound. It states that the precision of any unbiased estimator is at most the Fisher information; or (equivalently) the reciprocal of the Fisher information is a lower bound on its variance.

The Gram–Charlier A series, and the Edgeworth series are series that approximate a probability distribution in terms of its cumulants. The series are the same; but, the arrangement of terms differ. The key idea of these expansions is to write the characteristic function of the distribution whose probability density function $f$ is to be approximated in terms of the characteristic function of a distribution with known and suitable properties, and to recover $f$ through the inverse Fourier transform.

In statistics, a mixture model is a probabilistic model for representing the presence of subpopulations within an overall population, without requiring that an observed data set should identify the sub-population to which an individual observation belongs. Formally a mixture model corresponds to the mixture distribution that represents the probability distribution of observations in the overall population. However, while problems associated with "mixture distributions" relate to deriving the properties of the overall population from those of the sub-populations, "mixture models" are used to make statistical inferences about the properties of the sub-populations given only observations on the pooled population, without sub-population identity information. Mixture models are used for clustering, under the name model-based clustering, and also for density estimation.

In mathematics, particularly in functional analysis, a projection-valued measure is a function defined on certain subsets of a fixed set and whose values are self-adjoint projections on a fixed Hilbert space. A projection-valued measure (PVM) is formally similar to a real-valued measure, except that its values are self-adjoint projections rather than real numbers. As in the case of ordinary measures, it is possible to integrate complex-valued functions with respect to a PVM; the result of such an integration is a linear operator on the given Hilbert space.

<span class="mw-page-title-main">Dirichlet distribution</span> Probability distribution

In probability and statistics, the Dirichlet distribution (after Peter Gustav Lejeune Dirichlet), often denoted $, is a family of continuous multivariate probability distributions parameterized by a vector of positive reals. It is a multivariate generalization of the beta distribution, hence its alternative name of multivariate beta distribution (MBD) . Dirichlet distributions are commonly used as prior distributions in Bayesian statistics, and in fact, the Dirichlet distribution is the conjugate prior of the categorical distribution and multinomial distribution.$

In complexity theory, the Karp–Lipton theorem states that if the Boolean satisfiability problem (SAT) can be solved by Boolean circuits with a polynomial number of logic gates, then

The following are important identities involving derivatives and integrals in vector calculus.

The folded normal distribution is a probability distribution related to the normal distribution. Given a normally distributed random variable X with mean μ and variance σ², the random variable Y = |X| has a folded normal distribution. Such a case may be encountered if only the magnitude of some variable is recorded, but not its sign. The distribution is called "folded" because probability mass to the left of x = 0 is folded over by taking the absolute value. In the physics of heat conduction, the folded normal distribution is a fundamental solution of the heat equation on the half space; it corresponds to having a perfect insulator on a hyperplane through the origin.

In statistics, the inverse Wishart distribution, also called the inverted Wishart distribution, is a probability distribution defined on real-valued positive-definite matrices. In Bayesian statistics it is used as the conjugate prior for the covariance matrix of a multivariate normal distribution.

In probability theory and statistics, the Dirichlet-multinomial distribution is a family of discrete multivariate probability distributions on a finite support of non-negative integers. It is also called the Dirichlet compound multinomial distribution (DCM) or multivariate Pólya distribution. It is a compound probability distribution, where a probability vector p is drawn from a Dirichlet distribution with parameter vector $, and an observation drawn from a multinomial distribution with probability vector p and number of trials n . The Dirichlet parameter vector captures the prior belief about the situation and can be seen as a pseudocount: observations of each outcome that occur before the actual data is collected. The compounding corresponds to a Pólya urn scheme. It is frequently encountered in Bayesian statistics, machine learning, empirical Bayes methods and classical statistics as an overdispersed multinomial distribution.$

<span class="mw-page-title-main">Truncated normal distribution</span> Type of probability distribution

In probability and statistics, the truncated normal distribution is the probability distribution derived from that of a normally distributed random variable by bounding the random variable from either below or above. The truncated normal distribution has wide applications in statistics and econometrics.

In mathematical analysis, and especially in real, harmonic analysis and functional analysis, an Orlicz space is a type of function space which generalizes the L^p spaces. Like the L^p spaces, they are Banach spaces. The spaces are named for Władysław Orlicz, who was the first to define them in 1932.

In probability theory and statistics, the generalized chi-squared distribution is the distribution of a quadratic form of a multinormal variable, or a linear combination of different normal variables and squares of normal variables. Equivalently, it is also a linear sum of independent noncentral chi-square variables and a normal variable. There are several other such generalizations for which the same term is sometimes used; some of them are special cases of the family discussed here, for example the gamma distribution.

Lagrangian field theory is a formalism in classical field theory. It is the field-theoretic analogue of Lagrangian mechanics. Lagrangian mechanics is used to analyze the motion of a system of discrete particles each with a finite number of degrees of freedom. Lagrangian field theory applies to continua and fields, which have an infinite number of degrees of freedom.

References

1 2 3 Zhang, N.L., Poole, D.:A Simple Approach to Bayesian Network Computations.In: 7th Canadian Conference on Artificial Intelligence, pp. 171--178. Springer, New York (1994)
1 2 3 4 5 6 7 8 Darwiche, Adnan (2009-01-01). Modeling and Reasoning with Bayesian Networks. doi:10.1017/cbo9780511811357. ISBN 9780511811357.
↑ Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Techniques. MIT Press, Cambridge, MA (2009)

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[zhang-1] 1 2 3 Zhang, N.L., Poole, D.:A Simple Approach to Bayesian Network Computations.In: 7th Canadian Conference on Artificial Intelligence, pp. 171--178. Springer, New York (1994)

[:0-2] 1 2 3 4 5 6 7 8 Darwiche, Adnan (2009-01-01). Modeling and Reasoning with Bayesian Networks. doi:10.1017/cbo9780511811357. ISBN 9780511811357.

[3] Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Techniques. MIT Press, Cambridge, MA (2009)

[1]

[2]

[3]