Part of a series on |
Bayesian statistics |
---|
Posterior = Likelihood × Prior ÷ Evidence |
Background |
Model building |
Posterior approximation |
Estimators |
Evidence approximation |
Model evaluation |
Part of a series on statistics |
Probability theory |
---|
Bayesian programming is a formalism and a methodology for having a technique to specify probabilistic models and solve problems when less than the necessary information is available.
Edwin T. Jaynes proposed that probability could be considered as an alternative and an extension of logic for rational reasoning with incomplete and uncertain information. In his founding book Probability Theory: The Logic of Science [1] he developed this theory and proposed what he called “the robot,” which was not a physical device, but an inference engine to automate probabilistic reasoning—a kind of Prolog for probability instead of logic. Bayesian programming [2] is a formal and concrete implementation of this "robot".
Bayesian programming may also be seen as an algebraic formalism to specify graphical models such as, for instance, Bayesian networks, dynamic Bayesian networks, Kalman filters or hidden Markov models. Indeed, Bayesian Programming is more general than Bayesian networks and has a power of expression equivalent to probabilistic factor graphs. [3]
A Bayesian program is a means of specifying a family of probability distributions.
The constituent elements of a Bayesian program are presented below: [4]
The purpose of a description is to specify an effective method of computing a joint probability distribution on a set of variables given a set of experimental data and some specification . This joint distribution is denoted as: . [5]
To specify preliminary knowledge , the programmer must undertake the following:
Given a partition of containing subsets, variables are defined , each corresponding to one of these subsets. Each variable is obtained as the conjunction of the variables belonging to the subset. Recursive application of Bayes' theorem leads to:
Conditional independence hypotheses then allow further simplifications. A conditional independence hypothesis for variable is defined by choosing some variable among the variables appearing in the conjunction , labelling as the conjunction of these chosen variables and setting:
We then obtain:
Such a simplification of the joint distribution as a product of simpler distributions is called a decomposition, derived using the chain rule.
This ensures that each variable appears at the most once on the left of a conditioning bar, which is the necessary and sufficient condition to write mathematically valid decompositions.[ citation needed ]
Each distribution appearing in the product is then associated with either a parametric form (i.e., a function ) or a question to another Bayesian program .
When it is a form , in general, is a vector of parameters that may depend on or or both. Learning takes place when some of these parameters are computed using the data set .
An important feature of Bayesian Programming is this capacity to use questions to other Bayesian programs as components of the definition of a new Bayesian program. is obtained by some inferences done by another Bayesian program defined by the specifications and the data . This is similar to calling a subroutine in classical programming and provides an easy way to build hierarchical models.
Given a description (i.e., ), a question is obtained by partitioning into three sets: the searched variables, the known variables and the free variables.
The 3 variables , and are defined as the conjunction of the variables belonging to these sets.
A question is defined as the set of distributions:
made of many "instantiated questions" as the cardinal of , each instantiated question being the distribution:
Given the joint distribution , it is always possible to compute any possible question using the following general inference:
where the first equality results from the marginalization rule, the second results from Bayes' theorem and the third corresponds to a second application of marginalization. The denominator appears to be a normalization term and can be replaced by a constant .
Theoretically, this allows to solve any Bayesian inference problem. In practice, however, the cost of computing exhaustively and exactly is too great in almost all cases.
Replacing the joint distribution by its decomposition we get:
which is usually a much simpler expression to compute, as the dimensionality of the problem is considerably reduced by the decomposition into a product of lower dimension distributions.
The purpose of Bayesian spam filtering is to eliminate junk e-mails.
The problem is very easy to formulate. E-mails should be classified into one of two categories: non-spam or spam. The only available information to classify the e-mails is their content: a set of words. Using these words without taking the order into account is commonly called a bag of words model.
The classifier should furthermore be able to adapt to its user and to learn from experience. Starting from an initial standard setting, the classifier should modify its internal parameters when the user disagrees with its own decision. It will hence adapt to the user's criteria to differentiate between non-spam and spam. It will improve its results as it encounters increasingly classified e-mails.
The variables necessary to write this program are as follows:
These binary variables sum up all the information about an e-mail.
Starting from the joint distribution and applying recursively Bayes' theorem we obtain:
This is an exact mathematical expression.
It can be drastically simplified by assuming that the probability of appearance of a word knowing the nature of the text (spam or not) is independent of the appearance of the other words. This is the naive Bayes assumption and this makes this spam filter a naive Bayes model.
For instance, the programmer can assume that:
to finally obtain:
This kind of assumption is known as the naive Bayes' assumption. It is "naive" in the sense that the independence between words is clearly not completely true. For instance, it completely neglects that the appearance of pairs of words may be more significant than isolated appearances. However, the programmer may assume this hypothesis and may develop the model and the associated inferences to test how reliable and efficient it is.
To be able to compute the joint distribution, the programmer must now specify the distributions appearing in the decomposition:
where stands for the number of appearances of the word in non-spam e-mails and stands for the total number of non-spam e-mails. Similarly, stands for the number of appearances of the word in spam e-mails and stands for the total number of spam e-mails.
The forms are not yet completely specified because the parameters , , and have no values yet.
The identification of these parameters could be done either by batch processing a series of classified e-mails or by an incremental updating of the parameters using the user's classifications of the e-mails as they arrive.
Both methods could be combined: the system could start with initial standard values of these parameters issued from a generic database, then some incremental learning customizes the classifier to each individual user.
The question asked to the program is: "what is the probability for a given text to be spam knowing which words appear and don't appear in this text?" It can be formalized by:
which can be computed as follows:
The denominator appears to be a normalization constant. It is not necessary to compute it to decide if we are dealing with spam. For instance, an easy trick is to compute the ratio:
This computation is faster and easier because it requires only products.
The Bayesian spam filter program is completely defined by:
Bayesian filters (often called Recursive Bayesian estimation) are generic probabilistic models for time evolving processes. Numerous models are particular instances of this generic approach, for instance: the Kalman filter or the Hidden Markov model (HMM).
The decomposition is based:
The parametrical forms are not constrained and different choices lead to different well-known models: see Kalman filters and Hidden Markov models just below.
The typical question for such models is : what is the probability distribution for the state at time knowing the observations from instant to ?
The most common case is Bayesian filtering where , which searches for the present state, knowing past observations.
However, it is also possible , to extrapolate a future state from past observations, or to do smoothing , to recover a past state from observations made either before or after that instant.
More complicated questions may also be asked as shown below in the HMM section.
Bayesian filters have a very interesting recursive property, which contributes greatly to their attractiveness. may be computed simply from with the following formula:
Another interesting point of view for this equation is to consider that there are two phases: a prediction phase and an estimation phase:
The very well-known Kalman filters [6] are a special case of Bayesian filters.
They are defined by the following Bayesian program:
With these hypotheses and by using the recursive formula, it is possible to solve the inference problem analytically to answer the usual question. This leads to an extremely efficient algorithm, which explains the popularity of Kalman filters and the number of their everyday applications.
When there are no obvious linear transition and observation models, it is still often possible, using a first-order Taylor's expansion, to treat these models as locally linear. This generalization is commonly called the extended Kalman filter.
Hidden Markov models (HMMs) are another very popular specialization of Bayesian filters.
They are defined by the following Bayesian program:
both specified using probability matrices.
What is the most probable series of states that leads to the present state, knowing the past observations?
This particular question may be answered with a specific and very efficient algorithm called the Viterbi algorithm.
The Baum–Welch algorithm has been developed for HMMs.
Since 2000, Bayesian programming has been used to develop both robotics applications and life sciences models. [7]
In robotics, bayesian programming was applied to autonomous robotics, [8] [9] [10] [11] [12] robotic CAD systems, [13] advanced driver-assistance systems, [14] robotic arm control, mobile robotics, [15] [16] human-robot interaction, [17] human-vehicle interaction (Bayesian autonomous driver models) [18] [19] [20] [21] [22] video game avatar programming and training [23] and real-time strategy games (AI). [24]
In life sciences, bayesian programming was used in vision to reconstruct shape from motion, [25] to model visuo-vestibular interaction [26] and to study saccadic eye movements; [27] in speech perception and control to study early speech acquisition [28] and the emergence of articulatory-acoustic systems; [29] and to model handwriting perception and control. [30]
Bayesian program learning has potential applications voice recognition and synthesis, image recognition and natural language processing. It employs the principles of compositionality (building abstract representations from parts), causality (building complexity from parts) and learning to learn (using previously recognized concepts to ease the creation of new concepts). [31]
The comparison between probabilistic approaches (not only bayesian programming) and possibility theories continues to be debated.
Possibility theories like, for instance, fuzzy sets, [32] fuzzy logic [33] and possibility theory [34] are alternatives to probability to model uncertainty. They argue that probability is insufficient or inconvenient to model certain aspects of incomplete/uncertain knowledge.
The defense of probability is mainly based on Cox's theorem, which starts from four postulates concerning rational reasoning in the presence of uncertainty. It demonstrates that the only mathematical framework that satisfies these postulates is probability theory. The argument is that any approach other than probability necessarily infringes one of these postulates and the value of that infringement.
The purpose of probabilistic programming is to unify the scope of classical programming languages with probabilistic modeling (especially bayesian networks) to deal with uncertainty while profiting from the programming languages' expressiveness to encode complexity.
Extended classical programming languages include logical languages as proposed in Probabilistic Horn Abduction, [35] Independent Choice Logic, [36] PRISM, [37] and ProbLog which proposes an extension of Prolog.
It can also be extensions of functional programming languages (essentially Lisp and Scheme) such as IBAL or CHURCH. The underlying programming languages can be object-oriented as in BLOG and FACTORIE or more standard ones as in CES and FIGARO. [38]
The purpose of Bayesian programming is different. Jaynes' precept of "probability as logic" argues that probability is an extension of and an alternative to logic above which a complete theory of rationality, computation and programming can be rebuilt. [1] Bayesian programming attempts to replace classical languages with a programming approach based on probability that considers incompleteness and uncertainty.
The precise comparison between the semantics and power of expression of Bayesian and probabilistic programming is an open question.
In mathematical analysis, the Dirac delta function, also known as the unit impulse, is a generalized function on the real numbers, whose value is zero everywhere except at zero, and whose integral over the entire real line is equal to one. Thus it can be represented heuristically as
In mathematical analysis, the Haar measure assigns an "invariant volume" to subsets of locally compact topological groups, consequently defining an integral for functions on those groups.
A Fourier series is an expansion of a periodic function into a sum of trigonometric functions. The Fourier series is an example of a trigonometric series, but not all trigonometric series are Fourier series. By expressing a function as a sum of sines and cosines, many problems involving the function become easier to analyze because trigonometric functions are well understood. For example, Fourier series were first used by Joseph Fourier to find solutions to the heat equation. This application is possible because the derivatives of trigonometric functions fall into simple patterns. Fourier series cannot be used to approximate arbitrary functions, because most functions have infinitely many terms in their Fourier series, and the series do not always converge. Well-behaved functions, for example smooth functions, have Fourier series that converge to the original function. The coefficients of the Fourier series are determined by integrals of the function multiplied by trigonometric functions, described in Common forms of the Fourier series below.
In statistics, naive Bayes classifiers are a family of linear "probabilistic classifiers" which assumes that the features are conditionally independent, given the target class. The strength (naivety) of this assumption is what gives the classifier its name. These classifiers are among the simplest Bayesian network models.
In statistical mechanics and information theory, the Fokker–Planck equation is a partial differential equation that describes the time evolution of the probability density function of the velocity of a particle under the influence of drag forces and random forces, as in Brownian motion. The equation can be generalized to other observables as well. The Fokker-Planck equation has multiple applications in information theory, graph theory, data science, finance, economics etc.
A conformal field theory (CFT) is a quantum field theory that is invariant under conformal transformations. In two dimensions, there is an infinite-dimensional algebra of local conformal transformations, and conformal field theories can sometimes be exactly solved or classified.
A prior probability distribution of an uncertain quantity, often simply called the prior, is its assumed probability distribution before some evidence is taken into account. For example, the prior could be the probability distribution representing the relative proportions of voters who will vote for a particular politician in a future election. The unknown quantity may be a parameter of the model or a latent variable rather than an observable variable.
In mathematics, the Poisson summation formula is an equation that relates the Fourier series coefficients of the periodic summation of a function to values of the function's continuous Fourier transform. Consequently, the periodic summation of a function is completely defined by discrete samples of the original function's Fourier transform. And conversely, the periodic summation of a function's Fourier transform is completely defined by discrete samples of the original function. The Poisson summation formula was discovered by Siméon Denis Poisson and is sometimes called Poisson resummation.
In theoretical physics, the Batalin–Vilkovisky (BV) formalism was developed as a method for determining the ghost structure for Lagrangian gauge theories, such as gravity and supergravity, whose corresponding Hamiltonian formulation has constraints not related to a Lie algebra. The BV formalism, based on an action that contains both fields and "antifields", can be thought of as a vast generalization of the original BRST formalism for pure Yang–Mills theory to an arbitrary Lagrangian gauge theory. Other names for the Batalin–Vilkovisky formalism are field-antifield formalism, Lagrangian BRST formalism, or BV–BRST formalism. It should not be confused with the Batalin–Fradkin–Vilkovisky (BFV) formalism, which is the Hamiltonian counterpart.
In statistics, G-tests are likelihood-ratio or maximum likelihood statistical significance tests that are increasingly being used in situations where chi-squared tests were previously recommended.
In mathematics, the discrete-time Fourier transform (DTFT) is a form of Fourier analysis that is applicable to a sequence of discrete values.
The Wigner distribution function (WDF) is used in signal processing as a transform in time-frequency analysis.
In condensed matter physics and crystallography, the static structure factor is a mathematical description of how a material scatters incident radiation. The structure factor is a critical tool in the interpretation of scattering patterns obtained in X-ray, electron and neutron diffraction experiments.
In probability theory, Dirichlet processes are a family of stochastic processes whose realizations are probability distributions. In other words, a Dirichlet process is a probability distribution whose range is itself a set of probability distributions. It is often used in Bayesian inference to describe the prior knowledge about the distribution of random variables—how likely it is that the random variables are distributed according to one or another particular distribution.
In polynomial interpolation of two variables, the Padua points are the first known example of a unisolvent point set with minimal growth of their Lebesgue constant, proven to be . Their name is due to the University of Padua, where they were originally discovered.
In crystallography and solid state physics, the Laue equations relate incoming waves to outgoing waves in the process of elastic scattering, where the photon energy or light temporal frequency does not change upon scattering by a crystal lattice. They are named after physicist Max von Laue (1879–1960).
Static force fields are fields, such as a simple electric, magnetic or gravitational fields, that exist without excitations. The most common approximation method that physicists use for scattering calculations can be interpreted as static forces arising from the interactions between two bodies mediated by virtual particles, particles that exist for only a short time determined by the uncertainty principle. The virtual particles, also known as force carriers, are bosons, with different bosons associated with each force.
In quantum computing, the quantum phase estimation algorithm is a quantum algorithm to estimate the phase corresponding to an eigenvalue of a given unitary operator. Because the eigenvalues of a unitary operator always have unit modulus, they are characterized by their phase, and therefore the algorithm can be equivalently described as retrieving either the phase or the eigenvalue itself. The algorithm was initially introduced by Alexei Kitaev in 1995.
In quantum information theory, the classical capacity of a quantum channel is the maximum rate at which classical data can be sent over it error-free in the limit of many uses of the channel. Holevo, Schumacher, and Westmoreland proved the following least upper bound on the classical capacity of any quantum channel :
The two-rays ground-reflection model is a multipath radio propagation model which predicts the path losses between a transmitting antenna and a receiving antenna when they are in line of sight (LOS). Generally, the two antenna each have different height. The received signal having two components, the LOS component and the reflection component formed predominantly by a single ground reflected wave.