Geometric median

Last updated
Example of geometric median (in yellow) of a series of points. In blue the Center of mass. Geometric median example.svg
Example of geometric median (in yellow) of a series of points. In blue the Center of mass.

In geometry, the geometric median of a discrete set of sample points in a Euclidean space is the point minimizing the sum of distances to the sample points. This generalizes the median, which has the property of minimizing the sum of distances for one-dimensional data, and provides a central tendency in higher dimensions. It is also known as the spatial median, [1] Euclidean minisum point, [1] , Torricelli point, [2] or 1-median.

Contents

The geometric median is an important estimator of location in statistics, [3] because it minimizes the sum of the L2 distances of the samples. [4] It is to be compared to the mean, which minimizes the sum of the squared L2 distances, and to the coordinate-wise median which minimizes the sum of the L1 distances. It is also a standard problem in facility location, where it models the problem of locating a facility to minimize the cost of transportation. [5] The more general k-median problem asks for the location of k cluster centers minimizing the sum of L2 distances from each sample point to its nearest center.

The special case of the problem for three points in the plane (that is, m = 3 and n = 2 in the definition below) is sometimes also known as Fermat's problem; it arises in the construction of minimal Steiner trees, and was originally posed as a problem by Pierre de Fermat and solved by Evangelista Torricelli. [6] Its solution is now known as the Fermat point of the triangle formed by the three sample points. [7] The geometric median may in turn be generalized to the problem of minimizing the sum of weighted distances, known as the Weber problem after Alfred Weber's discussion of the problem in his 1909 book on facility location. [1] Some sources instead call Weber's problem the Fermat–Weber problem, [8] but others use this name for the unweighted geometric median problem. [9]

Wesolowsky (1993) provides a survey of the geometric median problem. See Fekete, Mitchell & Beurer (2005) for generalizations of the problem to non-discrete point sets.

Definition

Formally, for a given set of m points with each , the geometric median is defined as the sum of the L2 distances minimizer

Here, arg min means the value of the argument which minimizes the sum. In this case, it is the point in n dimensional Euclidean space from where the sum of all Euclidean distances to the 's is minimum.

Properties

Special cases

Computation

Despite the geometric median's being an easy-to-understand concept, computing it poses a challenge. The centroid or center of mass, defined similarly to the geometric median as minimizing the sum of the squares of the distances to each point, can be found by a simple formula — its coordinates are the averages of the coordinates of the points — but it has been shown that no explicit formula, nor an exact algorithm involving only arithmetic operations and kth roots, can exist in general for the geometric median. Therefore, only numerical or symbolic approximations to the solution of this problem are possible under this model of computation. [15]

However, it is straightforward to calculate an approximation to the geometric median using an iterative procedure in which each step produces a more accurate approximation. Procedures of this type can be derived from the fact that the sum of distances to the sample points is a convex function, since the distance to each sample point is convex and the sum of convex functions remains convex. Therefore, procedures that decrease the sum of distances at each step cannot get trapped in a local optimum.

One common approach of this type, called Weiszfeld's algorithm after the work of Endre Weiszfeld, [16] is a form of iteratively re-weighted least squares. This algorithm defines a set of weights that are inversely proportional to the distances from the current estimate to the sample points, and creates a new estimate that is the weighted average of the sample according to these weights. That is,

This method converges for almost all initial positions, but may fail to converge when one of its estimates falls on one of the given points. It can be modified to handle these cases so that it converges for all initial points. [12]

Bose, Maheshwari & Morin (2003) describe more sophisticated geometric optimization procedures for finding approximately optimal solutions to this problem. Cohen et al. (2016) show how to compute the geometric median to arbitrary precision in nearly linear time. Note also that the problem can be formulated as the second-order cone program

which can be solved in polynomial time using common optimization solvers.

Characterization of the geometric median

If y is distinct from all the given points, xi, then y is the geometric median if and only if it satisfies:

This is equivalent to:

which is closely related to Weiszfeld's algorithm.

In general, y is the geometric median if and only if there are vectors ui such that:

where for xiy,

and for xi = y,

An equivalent formulation of this condition is

It can be seen as a generalization of the median property, in the sense that any partition of the points, in particular as induced by any hyperplane through y, has the same and opposite sum of positive directions from y on each side. In the one dimensional case, the hyperplane is the point y itself, and the sum of directions simplifies to the (directed) counting measure.

Generalizations

The geometric median can be generalized from Euclidean spaces to general Riemannian manifolds (and even metric spaces) using the same idea which is used to define the Fréchet mean on a Riemannian manifold. [17] [18] Let be a Riemannian manifold with corresponding distance function , let be weights summing to 1, and let be observations from . Then we define the weighted geometric median (or weighted Fréchet median) of the data points as

.

If all the weights are equal, we say simply that is the geometric median.

See also

Notes

  1. 1 2 3 Drezner et al. (2002)
  2. Cieslik (2006).
  3. Lawera & Thompson (1993).
  4. Dodge & Rousson (1999).
  5. Eiselt & Marianov (2011).
  6. Krarup & Vajda (1997).
  7. Spain (1996).
  8. Brimberg (1995).
  9. Bose, Maheshwari & Morin (2003).
  10. 1 2 3 Haldane (1948)
  11. Claim 18.10, Geometric Methods and Optimization Problems, V. Boltyanski, H. Martini, V. Soltan, Springer, 1999.
  12. 1 2 Vardi & Zhang (2000)
  13. 1 2 3 Lopuhaä & Rousseeuw (1991)
  14. Cieslik (2006), p. 6; Plastria (2006). The convex case was originally proven by Giovanni Fagnano.
  15. Bajaj (1986); Bajaj (1988). Earlier, Cockayne & Melzak (1969) proved that the Steiner point for 5 points in the plane cannot be constructed with ruler and compass
  16. Weiszfeld (1937); Kuhn (1973); Chandrasekaran & Tamir (1989).
  17. Fletcher, P. Thomas; Venkatasubramanian, Suresh; Joshi, Sarang (23 June 2008). "Robust statistics on Riemannian manifolds via the geometric median". 2008 IEEE Conference on Computer Vision and Pattern Recognition. IEEE Conference on Computer Vision and Pattern Recognition. Anchorage, AK, USA: IEEE.
  18. Fletcher, Venkatasubramanian & Joshi (2009).

Related Research Articles

In statistics, a central tendency is a central or typical value for a probability distribution.

<span class="mw-page-title-main">Euclidean space</span> Fundamental space of geometry

Euclidean space is the fundamental space of geometry, intended to represent physical space. Originally, in Euclid's Elements, it was the three-dimensional space of Euclidean geometry, but in modern mathematics there are Euclidean spaces of any positive integer dimension n, which are called Euclidean n-spaces when one wants to specify their dimension. For n equal to one or two, they are commonly called respectively Euclidean lines and Euclidean planes. The qualifier "Euclidean" is used to distinguish Euclidean spaces from other spaces that were later considered in physics and modern mathematics.

<span class="mw-page-title-main">Kolmogorov–Smirnov test</span> Non-parametric statistical test between two distributions

In statistics, the Kolmogorov–Smirnov test is a nonparametric test of the equality of continuous, one-dimensional probability distributions that can be used to test whether a sample came from a given reference probability distribution, or to test whether two samples came from the same distribution. Intuitively, the test provides a method to qualitatively answer the question "How likely is it that we would see a collection of samples like this if they were drawn from that probability distribution?" or, in the second case, "How likely is it that we would see two sets of samples like this if they were drawn from the same probability distribution?". It is named after Andrey Kolmogorov and Nikolai Smirnov.

<span class="mw-page-title-main">Median</span> Middle quantile of a data set or probability distribution

In statistics and probability theory, the median is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution. For a data set, it may be thought of as "the middle" value. The basic feature of the median in describing data compared to the mean is that it is not skewed by a small proportion of extremely large or small values, and therefore provides a better representation of the center. Median income, for example, may be a better way to describe the center of the income distribution because increases in the largest incomes alone have no effect on the median. For this reason, the median is of central importance in robust statistics.

In probability theory, Chebyshev's inequality provides an upper bound on the probability of deviation of a random variable from its mean. More specifically, the probability that a random variable deviates from its mean by more than is at most , where is any positive constant.

<span class="mw-page-title-main">Centroid</span> Mean position of all the points in a shape

In mathematics and physics, the centroid, also known as geometric center or center of figure, of a plane figure or solid figure is the arithmetic mean position of all the points in the surface of the figure. The same definition extends to any object in -dimensional Euclidean space.

This is a glossary of some terms used in Riemannian geometry and metric geometry — it doesn't cover the terminology of differential topology.

The Mahalanobis distance is a measure of the distance between a point and a distribution , introduced by P. C. Mahalanobis in 1936. Mahalanobis's definition was prompted by the problem of identifying the similarities of skulls based on measurements in 1927.

<span class="mw-page-title-main">Radon's theorem</span> Says d+2 points in d dimensions can be partitioned into two subsets whose convex hulls intersect

In geometry, Radon's theorem on convex sets, published by Johann Radon in 1921, states that:

Any set of d + 2 points in Rd can be partitioned into two sets whose convex hulls intersect.

In mathematics, Hilbert's fourth problem in the 1900 list of Hilbert's problems is a foundational question in geometry. In one statement derived from the original, it was to find — up to an isomorphism — all geometries that have an axiomatic system of the classical geometry, with those axioms of congruence that involve the concept of the angle dropped, and `triangle inequality', regarded as an axiom, added.

k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. k-means clustering minimizes within-cluster variances, but not regular Euclidean distances, which would be the more difficult Weber problem: the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances. For instance, better Euclidean solutions can be found using k-medians and k-medoids.

<span class="mw-page-title-main">Varignon frame</span>

The Varignon frame, named after Pierre Varignon, is a mechanical device which can be used to determine an optimal location of a warehouse for the destribution of goods to a set of shops. Optimal means that the sum of the weighted distances of the shops to the warehouse should be minimal. The frame consists of a board with n holes corresponding to the n shops at the locations , n strings are tied together in a knot at one end, the loose ends are passed, one each, through the holes and are attached to weights below the board. If the influence of friction and other odds of the real world are neglected, the knot will take a position of equilibrium . It can be shown, that point is the optimal location which minimizes the weighted sum of distances

<span class="mw-page-title-main">Fermat point</span> Triangle center minimizing sum of distances to each vertex

In Euclidean geometry, the Fermat point of a triangle, also called the Torricelli point or Fermat–Torricelli point, is a point such that the sum of the three distances from each of the three vertices of the triangle to the point is the smallest possible or, equivalently, the geometric median of the three vertices. It is so named because this problem was first raised by Fermat in a private letter to Evangelista Torricelli, who solved it.

In mathematics, a Euclidean distance matrix is an n×n matrix representing the spacing of a set of n points in Euclidean space. For points in k-dimensional space k, the elements of their Euclidean distance matrix A are given by squares of distances between them. That is

The study of facility location problems (FLP), also known as location analysis, is a branch of operations research and computational geometry concerned with the optimal placement of facilities to minimize transportation costs while considering factors like avoiding placing hazardous materials near housing, and competitors' facilities. The techniques also apply to cluster analysis.

Mean shift is a non-parametric feature-space mathematical analysis technique for locating the maxima of a density function, a so-called mode-seeking algorithm. Application domains include cluster analysis in computer vision and image processing.

<span class="mw-page-title-main">Pu's inequality</span>

In differential geometry, Pu's inequality, proved by Pao Ming Pu, relates the area of an arbitrary Riemannian surface homeomorphic to the real projective plane with the lengths of the closed curves contained in it.

In differential geometry, Mikhail Gromov's filling area conjecture asserts that the hemisphere has minimum area among the orientable surfaces that fill a closed curve of given length without introducing shortcuts between its points.

In mathematics and statistics, the Fréchet mean is a generalization of centroids to metric spaces, giving a single representative point or central tendency for a cluster of points. It is named after Maurice Fréchet. Karcher mean is the renaming of the Riemannian Center of Mass construction developed by Karsten Grove and Hermann Karcher. On the real numbers, the arithmetic mean, median, geometric mean, and harmonic mean can all be interpreted as Fréchet means for different distance functions.

In geometry, the Weber problem, named after Alfred Weber, is one of the most famous problems in location theory. It requires finding a point in the plane that minimizes the sum of the transportation costs from this point to n destination points, where different destination points are associated with different costs per unit distance.

References