Markov random field

Last updated
An example of a Markov random field. Each edge represents dependency. In this example: A depends on B and D. B depends on A and D. D depends on A, B, and E. E depends on D and C. C depends on E. Markov random field example.png
An example of a Markov random field. Each edge represents dependency. In this example: A depends on B and D. B depends on A and D. D depends on A, B, and E. E depends on D and C. C depends on E.

In the domain of physics and probability, a Markov random field (MRF), Markov network or undirected graphical model is a set of random variables having a Markov property described by an undirected graph. In other words, a random field is said to be a Markov random field if it satisfies Markov properties. The concept originates from the Sherrington–Kirkpatrick model. [1]

Contents

A Markov network or MRF is similar to a Bayesian network in its representation of dependencies; the differences being that Bayesian networks are directed and acyclic, whereas Markov networks are undirected and may be cyclic. Thus, a Markov network can represent certain dependencies that a Bayesian network cannot (such as cyclic dependencies [ further explanation needed ]); on the other hand, it can't represent certain dependencies that a Bayesian network can (such as induced dependencies [ further explanation needed ]). The underlying graph of a Markov random field may be finite or infinite.

When the joint probability density of the random variables is strictly positive, it is also referred to as a Gibbs random field, because, according to the Hammersley–Clifford theorem, it can then be represented by a Gibbs measure for an appropriate (locally defined) energy function. The prototypical Markov random field is the Ising model; indeed, the Markov random field was introduced as the general setting for the Ising model. [2] In the domain of artificial intelligence, a Markov random field is used to model various low- to mid-level tasks in image processing and computer vision. [3]

Definition

Given an undirected graph , a set of random variables indexed by   form a Markov random field with respect to   if they satisfy the local Markov properties:

Pairwise Markov property: Any two non-adjacent variables are conditionally independent given all other variables:
Local Markov property: A variable is conditionally independent of all other variables given its neighbors:
where is the set of neighbors of , and is the closed neighbourhood of .
Global Markov property: Any two subsets of variables are conditionally independent given a separating subset:
where every path from a node in to a node in passes through .

The Global Markov property is stronger than the Local Markov property, which in turn is stronger than the Pairwise one. [4] However, the above three Markov properties are equivalent for positive distributions [5] (those that assign only nonzero probabilities to the associated variables).

The relation between the three Markov properties is particularly clear in the following formulation:

Clique factorization

As the Markov property of an arbitrary probability distribution can be difficult to establish, a commonly used class of Markov random fields are those that can be factorized according to the cliques of the graph.

Given a set of random variables , let be the probability of a particular field configuration in —that is, is the probability of finding that the random variables take on the particular value . Because is a set, the probability of should be understood to be taken with respect to a joint distribution of the .

If this joint density can be factorized over the cliques of as

then forms a Markov random field with respect to . Here, is the set of cliques of . The definition is equivalent if only maximal cliques are used. The functions are sometimes referred to as factor potentials or clique potentials. Note, however, conflicting terminology is in use: the word potential is often applied to the logarithm of . This is because, in statistical mechanics, has a direct interpretation as the potential energy of a configuration  .

Some MRF's do not factorize: a simple example can be constructed on a cycle of 4 nodes with some infinite energies, i.e. configurations of zero probabilities, [6] even if one, more appropriately, allows the infinite energies to act on the complete graph on . [7]

MRF's factorize if at least one of the following conditions is fulfilled:

When such a factorization does exist, it is possible to construct a factor graph for the network.

Exponential family

Any positive Markov random field can be written as exponential family in canonical form with feature functions such that the full-joint distribution can be written as

where the notation

is simply a dot product over field configurations, and Z is the partition function:

Here, denotes the set of all possible assignments of values to all the network's random variables. Usually, the feature functions are defined such that they are indicators of the clique's configuration, i.e. if corresponds to the i-th possible configuration of the k-th clique and 0 otherwise. This model is equivalent to the clique factorization model given above, if is the cardinality of the clique, and the weight of a feature corresponds to the logarithm of the corresponding clique factor, i.e., where is the i-th possible configuration of the k-th clique, i.e. the i-th value in the domain of the clique .

The probability P is often called the Gibbs measure. This expression of a Markov field as a logistic model is only possible if all clique factors are non-zero, i.e. if none of the elements of are assigned a probability of 0. This allows techniques from matrix algebra to be applied, e.g. that the trace of a matrix is log of the determinant, with the matrix representation of a graph arising from the graph's incidence matrix.

The importance of the partition function Z is that many concepts from statistical mechanics, such as entropy, directly generalize to the case of Markov networks, and an intuitive understanding can thereby be gained. In addition, the partition function allows variational methods to be applied to the solution of the problem: one can attach a driving force to one or more of the random variables, and explore the reaction of the network in response to this perturbation. Thus, for example, one may add a driving term Jv, for each vertex v of the graph, to the partition function to get:

Formally differentiating with respect to Jv gives the expectation value of the random variable Xv associated with the vertex v:

Correlation functions are computed likewise; the two-point correlation is:

Unfortunately, though the likelihood of a logistic Markov network is convex, evaluating the likelihood or gradient of the likelihood of a model requires inference in the model, which is generally computationally infeasible (see 'Inference' below).

Examples

Gaussian

A multivariate normal distribution forms a Markov random field with respect to a graph if the missing edges correspond to zeros on the precision matrix (the inverse covariance matrix):

such that

[8]

Inference

As in a Bayesian network, one may calculate the conditional distribution of a set of nodes given values to another set of nodes in the Markov random field by summing over all possible assignments to ; this is called exact inference. However, exact inference is a #P-complete problem, and thus computationally intractable in the general case. Approximation techniques such as Markov chain Monte Carlo and loopy belief propagation are often more feasible in practice. Some particular subclasses of MRFs, such as trees (see Chow–Liu tree), have polynomial-time inference algorithms; discovering such subclasses is an active research topic. There are also subclasses of MRFs that permit efficient MAP, or most likely assignment, inference; examples of these include associative networks. [9] [10] Another interesting sub-class is the one of decomposable models (when the graph is chordal): having a closed-form for the MLE, it is possible to discover a consistent structure for hundreds of variables. [11]

Conditional random fields

One notable variant of a Markov random field is a conditional random field , in which each random variable may also be conditioned upon a set of global observations . In this model, each function is a mapping from all assignments to both the clique k and the observations to the nonnegative real numbers. This form of the Markov network may be more appropriate for producing discriminative classifiers, which do not model the distribution over the observations. CRFs were proposed by John D. Lafferty, Andrew McCallum and Fernando C.N. Pereira in 2001. [12]

Varied applications

Markov random fields find application in a variety of fields, ranging from computer graphics to computer vision, machine learning or computational biology, [13] [14] and information retrieval. [15] MRFs are used in image processing to generate textures as they can be used to generate flexible and stochastic image models. In image modelling, the task is to find a suitable intensity distribution of a given image, where suitability depends on the kind of task and MRFs are flexible enough to be used for image and texture synthesis, image compression and restoration, image segmentation, 3D image inference from 2D images, image registration, texture synthesis, super-resolution, stereo matching and information retrieval. They can be used to solve various computer vision problems which can be posed as energy minimization problems or problems where different regions have to be distinguished using a set of discriminating features, within a Markov random field framework, to predict the category of the region. [16] Markov random fields were a generalization over the Ising model and have, since then, been used widely in combinatorial optimizations and networks.

See also

Related Research Articles

In probability theory, the central limit theorem (CLT) states that, under appropriate conditions, the distribution of a normalized version of the sample mean converges to a standard normal distribution. This holds even if the original variables themselves are not normally distributed. There are several versions of the CLT, each applying in the context of different conditions.

<span class="mw-page-title-main">Multivariate random variable</span> Random variable with multiple component dimensions

In probability, and statistics, a multivariate random variable or random vector is a list or vector of mathematical variables each of whose value is unknown, either because the value has not yet occurred or because there is imperfect knowledge of its value. The individual variables in a random vector are grouped together because they are all part of a single mathematical system — often they represent different properties of an individual statistical unit. For example, while a given person has a specific age, height and weight, the representation of these features of an unspecified person from within a group would be a random vector. Normally each element of a random vector is a real number.

A hidden Markov model (HMM) is a Markov model in which the observations are dependent on a latent Markov process. An HMM requires that there be an observable process whose outcomes depend on the outcomes of in a known way. Since cannot be observed directly, the goal is to learn about state of by observing By definition of being a Markov model, an HMM has an additional requirement that the outcome of at time must be "influenced" exclusively by the outcome of at and that the outcomes of and at must be conditionally independent of at given at time Estimation of the parameters in an HMM can be performed using maximum likelihood. For linear chain HMMs, the Baum–Welch algorithm can be used to estimate the parameters.

In probability theory and statistics, two real-valued random variables, , , are said to be uncorrelated if their covariance, , is zero. If two variables are uncorrelated, there is no linear relationship between them.

A Bayesian network is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). While it is one of several forms of causal notation, causal networks are special cases of Bayesian networks. Bayesian networks are ideal for taking an event that occurred and predicting the likelihood that any one of several possible known causes was the contributing factor. For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases.

A graphical model or probabilistic graphical model (PGM) or structured probabilistic model is a probabilistic model for which a graph expresses the conditional dependence structure between random variables. They are commonly used in probability theory, statistics—particularly Bayesian statistics—and machine learning.

<span class="mw-page-title-main">Kriging</span> Method of interpolation

In statistics, originally in geostatistics, kriging or Kriging, also known as Gaussian process regression, is a method of interpolation based on Gaussian process governed by prior covariances. Under suitable assumptions of the prior, kriging gives the best linear unbiased prediction (BLUP) at unsampled locations. Interpolating methods based on other criteria such as smoothness may not yield the BLUP. The method is widely used in the domain of spatial analysis and computer experiments. The technique is also known as Wiener–Kolmogorov prediction, after Norbert Wiener and Andrey Kolmogorov.

In physics and mathematics, a random field is a random function over an arbitrary domain. That is, it is a function that takes on a random value at each point (or some other domain). It is also sometimes thought of as a synonym for a stochastic process with some restriction on its index set. That is, by modern definitions, a random field is a generalization of a stochastic process where the underlying parameter need no longer be real or integer valued "time" but can instead take values that are multidimensional vectors or points on some manifold.

In probability and statistics, a mixture distribution is the probability distribution of a random variable that is derived from a collection of other random variables as follows: first, a random variable is selected by chance from the collection according to given probabilities of selection, and then the value of the selected random variable is realized. The underlying random variables may be random real numbers, or they may be random vectors, in which case the mixture distribution is a multivariate distribution.

Belief propagation, also known as sum–product message passing, is a message-passing algorithm for performing inference on graphical models, such as Bayesian networks and Markov random fields. It calculates the marginal distribution for each unobserved node, conditional on any observed nodes. Belief propagation is commonly used in artificial intelligence and information theory, and has demonstrated empirical success in numerous applications, including low-density parity-check codes, turbo codes, free energy approximation, and satisfiability.

<span class="mw-page-title-main">Conditional independence</span> Probability theory concept

In probability theory, conditional independence describes situations wherein an observation is irrelevant or redundant when evaluating the certainty of a hypothesis. Conditional independence is usually formulated in terms of conditional probability, as a special case where the probability of the hypothesis given the uninformative observation is equal to the probability without. If is the hypothesis, and and are observations, conditional independence can be stated as an equality:

In statistics, econometrics, epidemiology and related disciplines, the method of instrumental variables (IV) is used to estimate causal relationships when controlled experiments are not feasible or when a treatment is not successfully delivered to every unit in a randomized experiment. Intuitively, IVs are used when an explanatory variable of interest is correlated with the error term (endogenous), in which case ordinary least squares and ANOVA give biased results. A valid instrument induces changes in the explanatory variable but has no independent effect on the dependent variable and is not correlated with the error term, allowing a researcher to uncover the causal effect of the explanatory variable on the dependent variable.

A Markov logic network (MLN) is a probabilistic logic which applies the ideas of a Markov network to first-order logic, defining probability distributions on possible worlds on any given domain.

Conditional random fields (CRFs) are a class of statistical modeling methods often applied in pattern recognition and machine learning and used for structured prediction. Whereas a classifier predicts a label for a single sample without considering "neighbouring" samples, a CRF can take context into account. To do so, the predictions are modelled as a graphical model, which represents the presence of dependencies between the predictions. What kind of graph is used depends on the application. For example, in natural language processing, "linear chain" CRFs are popular, for which each prediction is dependent only on its immediate neighbours. In image processing, the graph typically connects locations to nearby and/or similar locations to enforce that they receive similar predictions.

The partition function or configuration integral, as used in probability theory, information theory and dynamical systems, is a generalization of the definition of a partition function in statistical mechanics. It is a special case of a normalizing constant in probability theory, for the Boltzmann distribution. The partition function occurs in many problems of probability theory because, in situations where there is a natural symmetry, its associated probability measure, the Gibbs measure, has the Markov property. This means that the partition function occurs not only in physical systems with translation symmetry, but also in such varied settings as neural networks, and applications such as genomics, corpus linguistics and artificial intelligence, which employ Markov networks, and Markov logic networks. The Gibbs measure is also the unique measure that has the property of maximizing the entropy for a fixed expectation value of the energy; this underlies the appearance of the partition function in maximum entropy methods and the algorithms derived therefrom.

The Hammersley–Clifford theorem is a result in probability theory, mathematical statistics and statistical mechanics that gives necessary and sufficient conditions under which a strictly positive probability distribution can be represented as events generated by a Markov network. It is the fundamental theorem of random fields. It states that a probability distribution that has a strictly positive mass or density satisfies one of the Markov properties with respect to an undirected graph G if and only if it is a Gibbs random field, that is, its density can be factorized over the cliques of the graph.

Graphical models have become powerful frameworks for protein structure prediction, protein–protein interaction, and free energy calculations for protein structures. Using a graphical model to represent the protein structure allows the solution of many problems including secondary structure prediction, protein-protein interactions, protein-drug interaction, and free energy calculations.

<span class="mw-page-title-main">Exponential family random graph models</span>

Exponential Random Graph Models (ERGMs) are a family of statistical models for analyzing data from social and other networks. Examples of networks examined using ERGM include knowledge networks, organizational networks, colleague networks, social media networks, networks of scientific development, and others.

In probability theory, concentration inequalities provide mathematical bounds on the probability of a random variable deviating from some value.

Dependency networks (DNs) are graphical models, similar to Markov networks, wherein each vertex (node) corresponds to a random variable and each edge captures dependencies among variables. Unlike Bayesian networks, DNs may contain cycles. Each node is associated to a conditional probability table, which determines the realization of the random variable given its parents.

References

  1. Sherrington, David; Kirkpatrick, Scott (1975), "Solvable Model of a Spin-Glass", Physical Review Letters, 35 (35): 1792–1796, Bibcode:1975PhRvL..35.1792S, doi:10.1103/PhysRevLett.35.1792
  2. Kindermann, Ross; Snell, J. Laurie (1980). Markov Random Fields and Their Applications (PDF). American Mathematical Society. ISBN   978-0-8218-5001-5. MR   0620955. Archived from the original (PDF) on 2017-08-10. Retrieved 2012-04-09.
  3. Li, S. Z. (2009). Markov Random Field Modeling in Image Analysis. Springer. ISBN   9781848002791.
  4. Lauritzen, Steffen (1996). Graphical models. Oxford: Clarendon Press. p. 33. ISBN   978-0198522195.
  5. Koller, Daphne; Friedman, Nir (2009). Probabilistic Graphical Models. MIT Press. p. 114-122. ISBN   9780262013192.
  6. Moussouris, John (1974). "Gibbs and Markov random systems with constraints". Journal of Statistical Physics. 10 (1): 11–33. Bibcode:1974JSP....10...11M. doi:10.1007/BF01011714. hdl: 10338.dmlcz/135184 . MR   0432132. S2CID   121299906.
  7. Gandolfi, Alberto; Lenarda, Pietro (2016). "A note on Gibbs and Markov Random Fields with constraints and their moments". Mathematics and Mechanics of Complex Systems. 4 (3–4): 407–422. doi: 10.2140/memocs.2016.4.407 .
  8. Rue, Håvard; Held, Leonhard (2005). Gaussian Markov random fields: theory and applications. CRC Press. ISBN   978-1-58488-432-3.
  9. Taskar, Benjamin; Chatalbashev, Vassil; Koller, Daphne (2004), "Learning associative Markov networks", in Brodley, Carla E. (ed.), Proceedings of the Twenty-First International Conference on Machine Learning (ICML 2004), Banff, Alberta, Canada, July 4-8, 2004, ACM International Conference Proceeding Series, vol. 69, Association for Computing Machinery, p. 102, CiteSeerX   10.1.1.157.329 , doi:10.1145/1015330.1015444, ISBN   978-1581138283, S2CID   11312524 .
  10. Duchi, John C.; Tarlow, Daniel; Elidan, Gal; Koller, Daphne (2006), "Using Combinatorial Optimization within Max-Product Belief Propagation", in Schölkopf, Bernhard; Platt, John C.; Hoffman, Thomas (eds.), Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 4-7, 2006, Advances in Neural Information Processing Systems, vol. 19, MIT Press, pp. 369–376.
  11. Petitjean, F.; Webb, G.I.; Nicholson, A.E. (2013). Scaling log-linear analysis to high-dimensional data (PDF). International Conference on Data Mining. Dallas, TX, USA: IEEE.
  12. "Two classic paper prizes for papers that appeared at ICML 2013". ICML. 2013. Retrieved 15 December 2014.
  13. Kindermann & Snell, Ross & Laurie (1980). Markov Random Fields and their Applications. Rhode Island: American Mathematical Society. ISBN   978-0-8218-5001-5.
  14. Banf, Michael; Rhee, Seung Y. (2017-02-01). "Enhancing gene regulatory network inference through data integration with markov random fields". Scientific Reports. 7 (1): 41174. Bibcode:2017NatSR...741174B. doi:10.1038/srep41174. ISSN   2045-2322. PMC   5286517 . PMID   28145456.
  15. Metzler, Donald; Croft, W.Bruce (2005). A Markov random field model for term dependencies. Proceedings of the 28th ACM SIGIR Conference. Salvador, Brazil: ACM. pp. 472–479. doi:10.1145/1076034.1076115.
  16. Zhang & Zakhor, Richard & Avideh (2014). "Automatic Identification of Window Regions on Indoor Point Clouds Using LiDAR and Cameras". VIP Lab Publications. CiteSeerX   10.1.1.649.303 .