Geometric median

Last updated
Example of geometric median (in yellow) of a series of points. In blue the Center of mass. Geometric median example.svg
Example of geometric median (in yellow) of a series of points. In blue the Center of mass.

In geometry, the geometric median of a discrete set of sample points in a Euclidean space is the point minimizing the sum of distances to the sample points. This generalizes the median, which has the property of minimizing the sum of distances for one-dimensional data, and provides a central tendency in higher dimensions. It is also known as the spatial median, [1] Euclidean minisum point, [1] Torricelli point, [2] or 1-median.

Contents

The geometric median is an important estimator of location in statistics, [3] because it minimizes the sum of the L2 distances of the samples. [4] It is to be compared to the mean, which minimizes the sum of the squared L2 distances, and to the coordinate-wise median which minimizes the sum of the L1 distances. It is also a standard problem in facility location, where it models the problem of locating a facility to minimize the cost of transportation. [5] The more general k-median problem asks for the location of k cluster centers minimizing the sum of L2 distances from each sample point to its nearest center.

The special case of the problem for three points in the plane (that is, m = 3 and n = 2 in the definition below) is sometimes also known as Fermat's problem; it arises in the construction of minimal Steiner trees, and was originally posed as a problem by Pierre de Fermat and solved by Evangelista Torricelli. [6] Its solution is now known as the Fermat point of the triangle formed by the three sample points. [7] The geometric median may in turn be generalized to the problem of minimizing the sum of weighted distances, known as the Weber problem after Alfred Weber's discussion of the problem in his 1909 book on facility location. [1] Some sources instead call Weber's problem the Fermat–Weber problem, [8] but others use this name for the unweighted geometric median problem. [9]

Wesolowsky (1993) provides a survey of the geometric median problem. See Fekete, Mitchell & Beurer (2005) for generalizations of the problem to non-discrete point sets.

Definition

Formally, for a given set of m points with each , the geometric median is defined as the sum of the L2 distances minimizer

Here, arg min means the value of the argument which minimizes the sum. In this case, it is the point in n-dimensional Euclidean space from where the sum of all Euclidean distances to the 's is minimum.

Properties

Special cases

Computation

Despite the geometric median's being an easy-to-understand concept, computing it poses a challenge. The centroid or center of mass, defined similarly to the geometric median as minimizing the sum of the squares of the distances to each point, can be found by a simple formula — its coordinates are the averages of the coordinates of the points — but it has been shown that no explicit formula, nor an exact algorithm involving only arithmetic operations and kth roots, can exist in general for the geometric median. Therefore, only numerical or symbolic approximations to the solution of this problem are possible under this model of computation. [15]

However, it is straightforward to calculate an approximation to the geometric median using an iterative procedure in which each step produces a more accurate approximation. Procedures of this type can be derived from the fact that the sum of distances to the sample points is a convex function, since the distance to each sample point is convex and the sum of convex functions remains convex. Therefore, procedures that decrease the sum of distances at each step cannot get trapped in a local optimum.

One common approach of this type, called Weiszfeld's algorithm after the work of Endre Weiszfeld, [16] is a form of iteratively re-weighted least squares. This algorithm defines a set of weights that are inversely proportional to the distances from the current estimate to the sample points, and creates a new estimate that is the weighted average of the sample according to these weights. That is,

This method converges for almost all initial positions, but may fail to converge when one of its estimates falls on one of the given points. It can be modified to handle these cases so that it converges for all initial points. [12]

Bose, Maheshwari & Morin (2003) describe more sophisticated geometric optimization procedures for finding approximately optimal solutions to this problem. Cohen et al. (2016) show how to compute the geometric median to arbitrary precision in nearly linear time. Note also that the problem can be formulated as the second-order cone program

which can be solved in polynomial time using common optimization solvers.

Characterization of the geometric median

If y is distinct from all the given points, xi, then y is the geometric median if and only if it satisfies:

This is equivalent to:

which is closely related to Weiszfeld's algorithm.

In general, y is the geometric median if and only if there are vectors ui such that:

where for xiy,

and for xi = y,

An equivalent formulation of this condition is

It can be seen as a generalization of the median property, in the sense that any partition of the points, in particular as induced by any hyperplane through y, has the same and opposite sum of positive directions from y on each side. In the one dimensional case, the hyperplane is the point y itself, and the sum of directions simplifies to the (directed) counting measure.

Generalizations

The geometric median can be generalized from Euclidean spaces to general Riemannian manifolds (and even metric spaces) using the same idea which is used to define the Fréchet mean on a Riemannian manifold. [17] [18] Let be a Riemannian manifold with corresponding distance function , let be weights summing to 1, and let be observations from . Then we define the weighted geometric median (or weighted Fréchet median) of the data points as

.

If all the weights are equal, we say simply that is the geometric median.

See also

Notes

  1. 1 2 3 Drezner et al. (2002)
  2. Cieslik (2006).
  3. Lawera & Thompson (1993).
  4. Dodge & Rousson (1999).
  5. Eiselt & Marianov (2011).
  6. Krarup & Vajda (1997).
  7. Spain (1996).
  8. Brimberg (1995).
  9. Bose, Maheshwari & Morin (2003).
  10. 1 2 3 Haldane (1948)
  11. Claim 18.10, Geometric Methods and Optimization Problems, V. Boltyanski, H. Martini, V. Soltan, Springer, 1999.
  12. 1 2 Vardi & Zhang (2000)
  13. 1 2 3 Lopuhaä & Rousseeuw (1991)
  14. Cieslik (2006), p. 6; Plastria (2006). The convex case was originally proven by Giovanni Fagnano.
  15. Bajaj (1986); Bajaj (1988). Earlier, Cockayne & Melzak (1969) proved that the Steiner point for 5 points in the plane cannot be constructed with ruler and compass
  16. Weiszfeld (1937); Kuhn (1973); Chandrasekaran & Tamir (1989).
  17. Fletcher, P. Thomas; Venkatasubramanian, Suresh; Joshi, Sarang (23 June 2008). "Robust statistics on Riemannian manifolds via the geometric median". 2008 IEEE Conference on Computer Vision and Pattern Recognition. IEEE Conference on Computer Vision and Pattern Recognition. Anchorage, AK, USA: IEEE.
  18. Fletcher, Venkatasubramanian & Joshi (2009).

Related Research Articles

In statistics, a central tendency is a central or typical value for a probability distribution.

<span class="mw-page-title-main">Euclidean space</span> Fundamental space of geometry

Euclidean space is the fundamental space of geometry, intended to represent physical space. Originally, in Euclid's Elements, it was the three-dimensional space of Euclidean geometry, but in modern mathematics there are Euclidean spaces of any positive integer dimension n, which are called Euclidean n-spaces when one wants to specify their dimension. For n equal to one or two, they are commonly called respectively Euclidean lines and Euclidean planes. The qualifier "Euclidean" is used to distinguish Euclidean spaces from other spaces that were later considered in physics and modern mathematics.

<span class="mw-page-title-main">Median</span> Middle quantile of a data set or probability distribution

The median of a set of numbers is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution. For a data set, it may be thought of as the “middle" value. The basic feature of the median in describing data compared to the mean is that it is not skewed by a small proportion of extremely large or small values, and therefore provides a better representation of the center. Median income, for example, may be a better way to describe the center of the income distribution because increases in the largest incomes alone have no effect on the median. For this reason, the median is of central importance in robust statistics.

A mean is a numeric quantity representing the center of a collection of numbers and is intermediate to the extreme values of a set of numbers. There are several kinds of means in mathematics, especially in statistics. Each attempts to summarize or typify a given group of data, illustrating the magnitude and sign of the data set. Which of these measures is most illuminating depends on what is being measured, and on context and purpose.

<span class="mw-page-title-main">Metric space</span> Mathematical space with a notion of distance

In mathematics, a metric space is a set together with a notion of distance between its elements, usually called points. The distance is measured by a function called a metric or distance function. Metric spaces are the most general setting for studying many of the concepts of mathematical analysis and geometry.

<span class="mw-page-title-main">Geodesic</span> Straight path on a curved surface or a Riemannian manifold

In geometry, a geodesic is a curve representing in some sense the shortest path (arc) between two points in a surface, or more generally in a Riemannian manifold. The term also has meaning in any differentiable manifold with a connection. It is a generalization of the notion of a "straight line".

In probability theory, Chebyshev's inequality provides an upper bound on the probability of deviation of a random variable from its mean. More specifically, the probability that a random variable deviates from its mean by more than is at most , where is any positive constant and is the standard deviation.

<span class="mw-page-title-main">Centroid</span> Mean position of all the points in a shape

In mathematics and physics, the centroid, also known as geometric center or center of figure, of a plane figure or solid figure is the arithmetic mean position of all the points in the surface of the figure. The same definition extends to any object in -dimensional Euclidean space.

In differential geometry, the Ricci curvature tensor, named after Gregorio Ricci-Curbastro, is a geometric object which is determined by a choice of Riemannian or pseudo-Riemannian metric on a manifold. It can be considered, broadly, as a measure of the degree to which the geometry of a given metric tensor differs locally from that of ordinary Euclidean space or pseudo-Euclidean space.

This is a glossary of some terms used in Riemannian geometry and metric geometry — it doesn't cover the terminology of differential topology.

The Mahalanobis distance is a measure of the distance between a point and a distribution , introduced by P. C. Mahalanobis in 1936. The mathematical details of Mahalanobis distance has appeared in the Journal of The Asiatic Society of Bengal. Mahalanobis's definition was prompted by the problem of identifying the similarities of skulls based on measurements. The sampling distribution of Mahalanobis distance has been obtained by Professor R.C. Bose, under the assumption of equal dispersion.

<span class="mw-page-title-main">Radon's theorem</span> Says d+2 points in d dimensions can be partitioned into two subsets whose convex hulls intersect

In geometry, Radon's theorem on convex sets, published by Johann Radon in 1921, states that:

Any set of d + 2 points in Rd can be partitioned into two sets whose convex hulls intersect.

In mathematics, Hilbert's fourth problem in the 1900 list of Hilbert's problems is a foundational question in geometry. In one statement derived from the original, it was to find — up to an isomorphism — all geometries that have an axiomatic system of the classical geometry, with those axioms of congruence that involve the concept of the angle dropped, and `triangle inequality', regarded as an axiom, added.

k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. k-means clustering minimizes within-cluster variances, but not regular Euclidean distances, which would be the more difficult Weber problem: the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances. For instance, better Euclidean solutions can be found using k-medians and k-medoids.

<span class="mw-page-title-main">Varignon frame</span>

The Varignon frame, named after Pierre Varignon, is a mechanical device which can be used to determine an optimal location of a warehouse for the destribution of goods to a set of shops. Optimal means that the sum of the weighted distances of the shops to the warehouse should be minimal. The frame consists of a board with n holes corresponding to the n shops at the locations , n strings are tied together in a knot at one end, the loose ends are passed, one each, through the holes and are attached to weights below the board. If the influence of friction and other odds of the real world are neglected, the knot will take a position of equilibrium . It can be shown, that point is the optimal location which minimizes the weighted sum of distances

<span class="mw-page-title-main">Systolic geometry</span> Form of differential geometry

In mathematics, systolic geometry is the study of systolic invariants of manifolds and polyhedra, as initially conceived by Charles Loewner and developed by Mikhail Gromov, Michael Freedman, Peter Sarnak, Mikhail Katz, Larry Guth, and others, in its arithmetical, ergodic, and topological manifestations. See also Introduction to systolic geometry.

The study of facility location problems (FLP), also known as location analysis, is a branch of operations research and computational geometry concerned with the optimal placement of facilities to minimize transportation costs while considering factors like avoiding placing hazardous materials near housing, and competitors' facilities. The techniques also apply to cluster analysis.

<span class="mw-page-title-main">Pu's inequality</span>

In differential geometry, Pu's inequality, proved by Pao Ming Pu, relates the area of an arbitrary Riemannian surface homeomorphic to the real projective plane with the lengths of the closed curves contained in it.

In mathematics and statistics, the Fréchet mean is a generalization of centroids to metric spaces, giving a single representative point or central tendency for a cluster of points. It is named after Maurice Fréchet. Karcher mean is the renaming of the Riemannian Center of Mass construction developed by Karsten Grove and Hermann Karcher. On the real numbers, the arithmetic mean, median, geometric mean, and harmonic mean can all be interpreted as Fréchet means for different distance functions.

In geometry, the Weber problem, named after Alfred Weber, is one of the most famous problems in location theory. It requires finding a point in the plane that minimizes the sum of the transportation costs from this point to n destination points, where different destination points are associated with different costs per unit distance.

References