Bregman divergence

Last updated

In mathematics, specifically statistics and information geometry, a Bregman divergence or Bregman distance is a measure of difference between two points, defined in terms of a strictly convex function; they form an important class of divergences. When the points are interpreted as probability distributions – notably as either values of the parameter of a parametric model or as a data set of observed values – the resulting distance is a statistical distance. The most basic Bregman divergence is the squared Euclidean distance.

Contents

Bregman divergences are similar to metrics, but satisfy neither the triangle inequality (ever) nor symmetry (in general). However, they satisfy a generalization of the Pythagorean theorem, and in information geometry the corresponding statistical manifold is interpreted as a (dually) flat manifold. This allows many techniques of optimization theory to be generalized to Bregman divergences, geometrically as generalizations of least squares.

Bregman divergences are named after Russian mathematician Lev M. Bregman, who introduced the concept in 1967.

Definition

Let be a continuously-differentiable, strictly convex function defined on a convex set .

The Bregman distance associated with F for points is the difference between the value of F at point p and the value of the first-order Taylor expansion of F around point q evaluated at point p:

Properties

Here, and are the dual points corresponding to p and q.
Moreover, using the same notations :

For any

Generalized Pythagorean theorem for Bregman divergence . Bregman divergence Pythagorean.png
Generalized Pythagorean theorem for Bregman divergence .

. Then

For any ,

This is an equality if is in the relative interior of .

In particular, this always happens when is an affine set.

Proofs

Fix . Take affine transform on , so that .

Take some , such that . Then consider the "radial-directional" derivative of on the Euclidean sphere .

for all .

Since is compact, it achieves minimal value at some .

Since is strictly convex, . Then .

Since is in , is continuous in , thus is closed if is.

Fix . Take some , then let . Then draw the Bregman ball . It is closed and bounded, thus compact. Since is continuous and strictly convex on it, and bounded below by , it achieves a unique minimum on it.

By cosine law, , which must be , since minimizes in , and is convex.

If , then since is in the relative interior, we can move from in the direction opposite of , to decrease , contradiction.

Thus .

Classification theorems

Proof
Bregman divergence interpreted as areas. Bregman divergence diagram used in proof of squared generalized Euclidean distances.png
Bregman divergence interpreted as areas.

For any , define for . Let .

Then for , and since is continuous, also for .

Then, from the diagram, we see that for for all , we must have linear on .

Thus we find that varies linearly along any direction. By the next lemma, is quadratic. Since is also strictly convex, it is of form , where .

Lemma: If is an open subset of , has continuous derivative, and given any line segment , the function is linear in , then is a quadratic function.

Proof idea: For any quadratic function , we have still has such derivative-linearity, so we will subtract away a few quadratic functions and show that becomes zero.

The proof idea can be illustrated fully for the case of , so we prove it in this case.

By the derivative-linearity, is a quadratic function on any line segment in . We subtract away four quadratic functions, such that becomes identically zero on the x-axis, y-axis, and the line.

Let , for well-chosen . Now use to remove the linear term, and use respectively to remove the quadratic terms along the three lines.

not on the origin, there exists a line across that intersects the x-axis, y-axis, and the line at three different points. Since is quadratic on , and is zero on three different points, is identically zero on , thus . Thus is quadratic.

The following two characterizations are for divergences on , the set of all probability measures on , with .

Define a divergence on as any function of type , such that for all , then:

Given a Bregman divergence , its "opposite", defined by , is generally not a Bregman divergence. For example, the Kullback-Leiber divergence is both a Bregman divergence and an f-divergence. Its reverse is also an f-divergence, but by the above characterization, the reverse KL divergence cannot be a Bregman divergence.

Examples

is generated by the negative entropy function
When restricted to the simplex, the last two terms cancel, giving the usual Kullback–Leibler divergence for distributions.
is generated by the convex function

Generalizing projective duality

A key tool in computational geometry is the idea of projective duality, which maps points to hyperplanes and vice versa, while preserving incidence and above-below relationships. There are numerous analytical forms of the projective dual: one common form maps the point to the hyperplane . This mapping can be interpreted (identifying the hyperplane with its normal) as the convex conjugate mapping that takes the point p to its dual point , where F defines the d-dimensional paraboloid .

If we now replace the paraboloid by an arbitrary convex function, we obtain a different dual mapping that retains the incidence and above-below properties of the standard projective dual. This implies that natural dual concepts in computational geometry like Voronoi diagrams and Delaunay triangulations retain their meaning in distance spaces defined by an arbitrary Bregman divergence. Thus, algorithms from "normal" geometry extend directly to these spaces (Boissonnat, Nielsen and Nock, 2010)

Generalization of Bregman divergences

Bregman divergences can be interpreted as limit cases of skewed Jensen divergences (see Nielsen and Boltz, 2011). Jensen divergences can be generalized using comparative convexity, and limit cases of these skewed Jensen divergences generalizations yields generalized Bregman divergence (see Nielsen and Nock, 2017). The Bregman chord divergence [6] is obtained by taking a chord instead of a tangent line.

Bregman divergence on other objects

Bregman divergences can also be defined between matrices, between functions, and between measures (distributions). Bregman divergences between matrices include the Stein's loss and von Neumann entropy. Bregman divergences between functions include total squared error, relative entropy, and squared bias; see the references by Frigyik et al. below for definitions and properties. Similarly Bregman divergences have also been defined over sets, through a submodular set function which is known as the discrete analog of a convex function. The submodular Bregman divergences subsume a number of discrete distance measures, like the Hamming distance, precision and recall, mutual information and some other set based distance measures (see Iyer & Bilmes, 2012 for more details and properties of the submodular Bregman.)

For a list of common matrix Bregman divergences, see Table 15.1 in. [7]

Applications

In machine learning, Bregman divergences are used to calculate the bi-tempered logistic loss, performing better than the softmax function with noisy datasets. [8]

Bregman divergence is used in the formulation of mirror descent, which includes optimization algorithms used in machine learning such as gradient descent and the hedge algorithm.

Related Research Articles

<span class="mw-page-title-main">Divergence</span> Vector operator in vector calculus

In vector calculus, divergence is a vector operator that operates on a vector field, producing a scalar field giving the quantity of the vector field's source at each point. More technically, the divergence represents the volume density of the outward flux of a vector field from an infinitesimal volume around a given point.

<span class="mw-page-title-main">Laplace's equation</span> Second-order partial differential equation

In mathematics and physics, Laplace's equation is a second-order partial differential equation named after Pierre-Simon Laplace, who first studied its properties. This is often written as

<span class="mw-page-title-main">Navier–Stokes equations</span> Equations describing the motion of viscous fluid substances

The Navier–Stokes equations are partial differential equations which describe the motion of viscous fluid substances. They were named after French engineer and physicist Claude-Louis Navier and the Irish physicist and mathematician George Gabriel Stokes. They were developed over several decades of progressively building the theories, from 1822 (Navier) to 1842–1850 (Stokes).

<span class="mw-page-title-main">Potential flow</span> Velocity field as the gradient of a scalar function

In fluid dynamics, potential flow is the ideal flow pattern of an inviscid fluid. Potential flows are described and determined by mathematical methods.

In mathematics, the Laplace operator or Laplacian is a differential operator given by the divergence of the gradient of a scalar function on Euclidean space. It is usually denoted by the symbols , (where is the nabla operator), or . In a Cartesian coordinate system, the Laplacian is given by the sum of second partial derivatives of the function with respect to each independent variable. In other coordinate systems, such as cylindrical and spherical coordinates, the Laplacian also has a useful form. Informally, the Laplacian Δf (p) of a function f at a point p measures by how much the average value of f over small spheres or balls centered at p deviates from f (p).

In probability and statistics, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, including the enabling of the user to calculate expectations, covariances using differentiation based on some useful algebraic properties, as well as for generality, as exponential families are in a sense very natural sets of distributions to consider. The term exponential class is sometimes used in place of "exponential family", or the older term Koopman–Darmois family. The terms "distribution" and "family" are often used loosely: specifically, an exponential family is a set of distributions, where the specific distribution varies with the parameter; however, a parametric family of distributions is often referred to as "a distribution", and the set of all exponential families is sometimes loosely referred to as "the" exponential family. They are distinct because they possess a variety of desirable properties, most importantly the existence of a sufficient statistic.

In mathematics and statistics, the quasi-arithmetic mean or generalised f-mean or Kolmogorov-Nagumo-de Finetti mean is one generalisation of the more familiar means such as the arithmetic mean and the geometric mean, using a function . It is also called Kolmogorov mean after Soviet mathematician Andrey Kolmogorov. It is a broader generalization than the regular generalized mean.

This is a list of some vector calculus formulae for working with common curvilinear coordinate systems.

In mathematics, the Helmholtz equation is the eigenvalue problem for the Laplace operator. It corresponds to the linear partial differential equation

In information theory, the Rényi entropy is a quantity that generalizes various notions of entropy, including Hartley entropy, Shannon entropy, collision entropy, and min-entropy. The Rényi entropy is named after Alfréd Rényi, who looked for the most general way to quantify information while preserving additivity for independent events. In the context of fractal dimension estimation, the Rényi entropy forms the basis of the concept of generalized dimensions.

In statistics, the delta method is a result concerning the approximate probability distribution for a function of an asymptotically normal statistical estimator from knowledge of the limiting variance of that estimator.

Stochastic approximation methods are a family of iterative methods typically used for root-finding problems or for optimization problems. The recursive update rules of stochastic approximation methods can be used, among other things, for solving linear systems when the collected data is corrupted by noise, or for approximating extreme values of functions which cannot be computed directly, but only estimated via noisy observations.

In mathematics, vector spherical harmonics (VSH) are an extension of the scalar spherical harmonics for use with vector fields. The components of the VSH are complex-valued functions expressed in the spherical coordinate basis vectors.

The derivatives of scalars, vectors, and second-order tensors with respect to second-order tensors are of considerable use in continuum mechanics. These derivatives are used in the theories of nonlinear elasticity and plasticity, particularly in the design of algorithms for numerical simulations.

In fluid dynamics, the Oseen equations describe the flow of a viscous and incompressible fluid at small Reynolds numbers, as formulated by Carl Wilhelm Oseen in 1910. Oseen flow is an improved description of these flows, as compared to Stokes flow, with the (partial) inclusion of convective acceleration.

In information geometry, a divergence is a kind of statistical distance: a binary function which establishes the separation from one probability distribution to another on a statistical manifold.

Curvilinear coordinates can be formulated in tensor calculus, with important applications in physics and engineering, particularly for describing transportation of physical quantities and deformation of matter in fluid mechanics and continuum mechanics.

A Stein discrepancy is a statistical divergence between two probability measures that is rooted in Stein's method. It was first formulated as a tool to assess the quality of Markov chain Monte Carlo samplers, but has since been used in diverse settings in statistics, machine learning and computer science.

In mathematics, mirror descent is an iterative optimization algorithm for finding a local minimum of a differentiable function.

<span class="mw-page-title-main">Diffusion model</span> Deep learning algorithm

In machine learning, diffusion models, also known as diffusion probabilistic models or score-based generative models, are a class of generative models. The goal of diffusion models is to learn a diffusion process that generates the probability distribution of a given dataset. It mainly consists of three major components: the forward process, the reverse process, and the sampling procedure. Three examples of generic diffusion modeling frameworks used in computer vision are denoising diffusion probabilistic models, noise conditioned score networks, and stochastic differential equations.

References

  1. 1 2 "Learning with Bregman Divergences" (PDF). utexas.edu. Retrieved 19 August 2023.
  2. Adamčík, Martin (2014). "The Information Geometry of Bregman Divergences and Some Applications in Multi-Expert Reasoning". Entropy. 16 (12): 6338–6381. Bibcode:2014Entrp..16.6338A. doi: 10.3390/e16126338 .
  3. Nielsen, Frank (28 October 2021). "Fast Approximations of the Jeffreys Divergence between Univariate Gaussian Mixtures via Mixture Conversions to Exponential-Polynomial Distributions". Entropy. 23 (11): 1417. arXiv: 2107.05901 . Bibcode:2021Entrp..23.1417N. doi: 10.3390/e23111417 . ISSN   1099-4300. PMC   8619509 . PMID   34828115.
  4. Nielsen, Frank; Boissonnat, Jean-Daniel; Nock, Richard (September 2010). "Bregman Voronoi Diagrams: Properties, Algorithms and Applications". Discrete & Computational Geometry . 44 (2): 281–307. arXiv: 0709.2196 . doi:10.1007/s00454-010-9256-1. ISSN   0179-5376. S2CID   1327029.
  5. 1 2 Jiao, Jiantao; Courtade, Thomas; No, Albert; Venkat, Kartik; Weissman, Tsachy (December 2014). "Information Measures: the Curious Case of the Binary Alphabet". IEEE Transactions on Information Theory. 60 (12): 7616–7626. arXiv: 1404.6810 . doi:10.1109/TIT.2014.2360184. ISSN   0018-9448. S2CID   13108908.
  6. Nielsen, Frank; Nock, Richard (2019). "The Bregman Chord Divergence". Geometric Science of Information. Lecture Notes in Computer Science. Vol. 11712. pp. 299–308. arXiv: 1810.09113 . doi:10.1007/978-3-030-26980-7_31. ISBN   978-3-030-26979-1. S2CID   53046425.
  7. "Matrix Information Geometry", R. Nock, B. Magdalou, E. Briys and F. Nielsen, pdf, from this book
  8. Ehsan Amid, Manfred K. Warmuth, Rohan Anil, Tomer Koren (2019). "Robust Bi-Tempered Logistic Loss Based on Bregman Divergences". Conference on Neural Information Processing Systems. pp. 14987-14996. pdf