Weighted median

Last updated October 15, 2024

In statistics, a weighted median of a sample is the 50% weighted percentile.^[1]^[2]^[3] It was first proposed by F. Y. Edgeworth in 1888.^[4]^[5] Like the median, it is useful as an estimator of central tendency, robust against outliers. It allows for non-uniform statistical weights related to, e.g., varying precision measurements in the sample.

Definition

General case

For $n$ distinct ordered elements $x_{1},x_{2},...,x_{n}$ with positive weights $w_{1},w_{2},...,w_{n}$ such that $\sum _{i=1}^{n}w_{i}=1$ , the weighted median is the element $x_{k}$ satisfying

\sum _{i=1}^{k-1}w_{i}\leq 1/2

and

\sum _{i=k+1}^{n}w_{i}\leq 1/2

Special case

Consider a set of elements in which two of the elements satisfy the general case. This occurs when both element's respective weights border the midpoint of the set of weights without encapsulating it; Rather, each element defines a partition equal to $1/2$ . These elements are referred to as the lower weighted median and upper weighted median. Their conditions are satisfied as follows:

Lower Weighted Median

\sum _{i=1}^{k-1}w_{i}<1/2

and

\sum _{i=k+1}^{n}w_{i}=1/2

Upper Weighted Median

\sum _{i=1}^{k-1}w_{i}=1/2

and

\sum _{i=k+1}^{n}w_{i}<1/2

Ideally, a new element would be created using the mean of the upper and lower weighted medians and assigned a weight of zero. This method is similar to finding the median of an even set. The new element would be a true median since the sum of the weights to either side of this partition point would be equal.
Depending on the application, it may not be possible or wise to create new data. In this case, the weighted median should be chosen based on which element keeps the partitions most equal. This will always be the weighted median with the lowest weight.
In the event that the upper and lower weighted medians are equal, the lower weighted median is generally accepted as originally proposed by Edgeworth.^[6]

Properties

The sum of weights in each of the two partitions should be as equal as possible.

If the weights of all numbers in the set are equal, then the weighted median reduces down to the median.

Examples

For simplicity, consider the set of numbers $\{1,2,3,4,5\}$ with each number having weights $\{0.15,0.1,0.2,0.3,0.25\}$ respectively. The median is 3 and the weighted median is the element corresponding to the weight 0.3, which is 4. The weights on each side of the pivot add up to 0.45 and 0.25, satisfying the general condition that each side be as even as possible. Any other weight would result in a greater difference between each side of the pivot.

Consider the set of numbers $\{1,2,3,4\}$ with each number having uniform weights $\{0.25,0.25,0.25,0.25\}$ respectively. Equal weights should result in a weighted median equal to the median. This median is 2.5 since it is an even set. The lower weighted median is 2 with partition sums of 0.25 and 0.5, and the upper weighted median is 3 with partition sums of 0.5 and 0.25. These partitions each satisfy their respective special condition and the general condition. It is ideal to introduce a new pivot by taking the mean of the upper and lower weighted medians when they exist. With this, the set of numbers is $\{1,2,2.5,3,4\}$ with each number having weights $\{0.25,0.25,0,0.25,0.25\}$ respectively. This creates partitions that both sum to 0.5. It can easily be seen that the weighted median and median are the same for any size set with equal weights.

Similarly, consider the set of numbers $\{1,2,3,4\}$ with each number having weights $\{0.49,0.01,0.25,0.25\}$ respectively. The lower weighted median is 2 with partition sums of 0.49 and 0.5, and the upper weighted median is 3 with partition sums of 0.5 and 0.25. In the case of working with integers or non-interval measures, the lower weighted median would be accepted since it is the lower weight of the pair and therefore keeps the partitions most equal. However, it is more ideal to take the mean of these weighted medians when it makes sense instead. Coincidentally, both the weighted median and median are equal to 2.5, but this will not always hold true for larger sets depending on the weight distribution.

Algorithm

The weighted median can be computed by sorting the set of numbers and finding the smallest set of numbers which sum to half the weight of the total weight. This algorithm takes $O(n\log n)$ time. There is a better approach to find the weighted median using a modified selection algorithm.^[1]

// Main call is WeightedMedian(a, 1, n)// Returns lower medianWeightedMedian(a[1..n],p,r)// Base case for single elementifr=pthenreturna[p]// Base case for two elements// Make sure we return the mean in the case that the two candidates have equal weightifr-p=1thenifa[p].w==a[r].wreturn(a[p]+a[r])/2ifa[p].w>a[r].wreturna[p]elsereturna[r]// Partition around pivot rq=partition(a,p,r)wl,wg=sumweightsofpartitions(p,q-1),(q+1,r)// If partitions are balanced then we are doneifwlandwgboth<1/2thenreturna[q]else// Increase pivot weight by the amount of partition we eliminateifwl>wgthena[q].w+=wg// Recurse on pivot inclusively WeightedMedian(a,p,q)elsea[q].w+=wlWeightedMedian(a,q,r)

Software/source code

A fast weighted median algorithm is implemented in a C extension for Python in the Robustats Python package.
R has many implementations, including matrixStats::weightedMedian(), spatstat::weighted.median(), and others.^[7]

Related Research Articles

In computer science, a selection algorithm is an algorithm for finding the $th smallest value in a collection of ordered values, such as numbers. The value that it finds is called the th order statistic. Selection includes as special cases the problems of finding the minimum, median, and maximum element in the collection. Selection algorithms include quickselect, and the median of medians algorithm. When applied to a collection of values, these algorithms take linear time, as expressed using big O notation. For data that is already structured, faster algorithms may be possible; as an extreme case, selection in an already-sorted array takes time .$

<span class="mw-page-title-main">Set cover problem</span> Classical problem in combinatorics

The set cover problem is a classical question in combinatorics, computer science, operations research, and complexity theory.

<span class="mw-page-title-main">Quickselect</span> Algorithm for the kth smallest element in an array

In computer science, quickselect is a selection algorithm to find the kth smallest element in an unordered list, also known as the kth order statistic. Like the related quicksort sorting algorithm, it was developed by Tony Hoare, and thus is also known as Hoare's selection algorithm. Like quicksort, it is efficient in practice and has good average-case performance, but has poor worst-case performance. Quickselect and its variants are the selection algorithms most often used in efficient real-world implementations.

<span class="mw-page-title-main">Quicksort</span> Divide and conquer sorting algorithm

Quicksort is an efficient, general-purpose sorting algorithm. Quicksort was developed by British computer scientist Tony Hoare in 1959 and published in 1961. It is still a commonly used algorithm for sorting. Overall, it is slightly faster than merge sort and heapsort for randomized data, particularly on larger distributions.

In numerical analysis and linear algebra, lower–upper (LU) decomposition or factorization factors a matrix as the product of a lower triangular matrix and an upper triangular matrix. The product sometimes includes a permutation matrix as well. LU decomposition can be viewed as the matrix form of Gaussian elimination. Computers usually solve square systems of linear equations using LU decomposition, and it is also a key step when inverting a matrix or computing the determinant of a matrix. The LU decomposition was introduced by the Polish astronomer Tadeusz Banachiewicz in 1938. To quote: "It appears that Gauss and Doolittle applied the method [of elimination] only to symmetric equations. More recent authors, for example, Aitken, Banachiewicz, Dwyer, and Crout … have emphasized the use of the method, or variations of it, in connection with non-symmetric problems … Banachiewicz … saw the point … that the basic problem is really one of matrix factorization, or “decomposition” as he called it." It is also sometimes referred to as LR decomposition.

The sample mean or empirical mean, and the sample covariance or empirical covariance are statistics computed from a sample of data on one or more random variables.

In geometry, the geometric median of a discrete set of sample points in a Euclidean space is the point minimizing the sum of distances to the sample points. This generalizes the median, which has the property of minimizing the sum of distances for one-dimensional data, and provides a central tendency in higher dimensions. It is also known as the spatial median, Euclidean minisum point, Torricelli point, or 1-median.

<span class="mw-page-title-main">Maximum cut</span> Problem of finding a maximum cut in a graph

In a graph, a maximum cut is a cut whose size is at least the size of any other cut. That is, it is a partition of the graph's vertices into two complementary sets $S$ and $T$ , such that the number of edges between $S$ and $T$ is as large as possible. Finding such a cut is known as the max-cut problem.

Clustering is the problem of partitioning data points into groups based on their similarity. Correlation clustering provides a method for clustering a set of objects into the optimum number of clusters without specifying that number in advance.

In mathematics, the Robinson–Schensted–Knuth correspondence, also referred to as the RSK correspondence or RSK algorithm, is a combinatorial bijection between matrices $A$ with non-negative integer entries and pairs $(P, Q)$ of semistandard Young tableaux of equal shape, whose size equals the sum of the entries of $A$ . More precisely the weight of $P$ is given by the column sums of $A$ , and the weight of $Q$ by its row sums. It is a generalization of the Robinson–Schensted correspondence, in the sense that taking $A$ to be a permutation matrix, the pair $(P, Q)$ will be the pair of standard tableaux associated to the permutation under the Robinson–Schensted correspondence.

In combinatorial optimization, the matroid intersection problem is to find a largest common independent set in two matroids over the same ground set. If the elements of the matroid are assigned real weights, the weighted matroid intersection problem is to find a common independent set with the maximum possible weight. These problems generalize many problems in combinatorial optimization including finding maximum matchings and maximum weight matchings in bipartite graphs and finding arborescences in directed graphs.

The Davies–Bouldin index (DBI), introduced by David L. Davies and Donald W. Bouldin in 1979, is a metric for evaluating clustering algorithms. This is an internal evaluation scheme, where the validation of how well the clustering has been done is made using quantities and features inherent to the dataset. This has a drawback that a good value reported by this method does not imply the best information retrieval.

In computer science, the range query problem consists of efficiently answering several queries regarding a given interval of elements within an array. For example, a common task, known as range minimum query, is finding the smallest value inside a given range within a list of numbers.

In computer science, the median of medians is an approximate median selection algorithm, frequently used to supply a good pivot for an exact selection algorithm, most commonly quickselect, that selects the kth smallest element of an initially unsorted array. Median of medians finds an approximate median in linear time. Using this approximate median as an improved pivot, the worst-case complexity of quickselect reduces from quadratic to linear, which is also the asymptotically optimal worst-case complexity of any selection algorithm. In other words, the median of medians is an approximate median-selection algorithm that helps building an asymptotically optimal, exact general selection algorithm, by producing good pivot elements.

In computer science, an optimal binary search tree (Optimal BST), sometimes called a weight-balanced binary tree, is a binary search tree which provides the smallest possible search time (or expected search time) for a given sequence of accesses (or access probabilities). Optimal BSTs are generally divided into two types: static and dynamic.

The cache-oblivious distribution sort is a comparison-based sorting algorithm. It is similar to quicksort, but it is a cache-oblivious algorithm, designed for a setting where the number of elements to sort is too large to fit in a cache where operations are done. In the external memory model, the number of memory transfers it needs to perform a sort of $items on a machine with cache of size and cache lines of length is, under the tall cache assumption that . This number of memory transfers has been shown to be asymptotically optimal for comparison sorts. This distribution sort also achieves the asymptotically optimal runtime complexity of .$

<span class="mw-page-title-main">Medcouple</span>

In statistics, the medcouple is a robust statistic that measures the skewness of a univariate distribution. It is defined as a scaled median difference between the left and right half of a distribution. Its robustness makes it suitable for identifying outliers in adjusted boxplots. Ordinary box plots do not fare well with skew distributions, since they label the longer unsymmetrical tails as outliers. Using the medcouple, the whiskers of a boxplot can be adjusted for skew distributions and thus have a more accurate identification of outliers for non-symmetrical distributions.

In the fair cake-cutting problem, the partners often have different entitlements. For example, the resource may belong to two shareholders such that Alice holds 8/13 and George holds 5/13. This leads to the criterion of weighted proportionality (WPR): there are several weights $that sum up to 1, and every partner should receive at least a fraction of the resource by their own valuation.$

In computer science, a parallel external memory (PEM) model is a cache-aware, external-memory abstract machine. It is the parallel-computing analogy to the single-processor external memory (EM) model. In a similar way, it is the cache-aware analogy to the parallel random-access machine (PRAM). The PEM model consists of a number of processors, together with their respective private caches and a shared main memory.

In computer science, multiway number partitioning is the problem of partitioning a multiset of numbers into a fixed number of subsets, such that the sums of the subsets are as similar as possible. It was first presented by Ronald Graham in 1969 in the context of the identical-machines scheduling problem. The problem is parametrized by a positive integer k, and called k-way number partitioning. The input to the problem is a multiset S of numbers, whose sum is k*T.

References

1 2 Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2001). Introduction to Algorithms. MIT Press. ISBN 9780262032933.
↑ Horowitz, Ellis; Sahni, Sartaj; Rajasekaran, Sanguthevar (1996-12-15). Computer Algorithms C++: C++ and Pseudocode Versions. Macmillan. ISBN 9780716783152.
↑ Bovik, Alan C (2010-07-21). Handbook of Image and Video Processing. Academic Press. ISBN 9780080533612.
↑ Edgeworth, F. Y. (1888). "On a New Method of Reducing Observations Relating to Several Quantities". Philosophical Magazine. 25 (154): 184–191. doi:10.1080/14786448808628170.
↑ Edgeworth, F. Y. (1887). "On Observations Relating to Several Quantities". Hermathena. 6 (13). Trinity College Dublin: 279–285. JSTOR 23036355.
↑ Lange, Kenneth (15 June 2010). Numerical Analysis for Statisticians (second ed.). Springer. p. 313. ISBN 978-1-4419-5944-7.
↑ Is there a weighted.median() function?

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[:0-1] 1 2 Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2001). Introduction to Algorithms. MIT Press. ISBN 9780262032933.

[2] Horowitz, Ellis; Sahni, Sartaj; Rajasekaran, Sanguthevar (1996-12-15). Computer Algorithms C++: C++ and Pseudocode Versions. Macmillan. ISBN 9780716783152.

[3] Bovik, Alan C (2010-07-21). Handbook of Image and Video Processing. Academic Press. ISBN 9780080533612.

[4] Edgeworth, F. Y. (1888). "On a New Method of Reducing Observations Relating to Several Quantities". Philosophical Magazine. 25 (154): 184–191. doi:10.1080/14786448808628170.

[5] Edgeworth, F. Y. (1887). "On Observations Relating to Several Quantities". Hermathena. 6 (13). Trinity College Dublin: 279–285. JSTOR 23036355.

[6] Lange, Kenneth (15 June 2010). Numerical Analysis for Statisticians (second ed.). Springer. p. 313. ISBN 978-1-4419-5944-7.

[7] Is there a weighted.median() function?

[1]

[2]

[3]

[4]

[5]

[6]

[7]