# Weighted median

Last updated

In statistics, a weighted median of a sample is the 50% weighted percentile. [1] [2] [3] It was first proposed by F. Y. Edgeworth in 1888. [4] [5] Like the median, it is useful as an estimator of central tendency, robust against outliers. It allows for non-uniform statistical weights related to, e.g., varying precision measurements in the sample.

## Definition

### General case

For ${\displaystyle n}$ distinct ordered elements ${\displaystyle x_{1},x_{2},...,x_{n}}$ with positive weights ${\displaystyle w_{1},w_{2},...,w_{n}}$ such that ${\displaystyle \sum _{i=1}^{n}w_{i}=1}$, the weighted median is the element ${\displaystyle x_{k}}$ satisfying

${\displaystyle \sum _{i=1}^{k-1}w_{i}\leq 1/2}$ and ${\displaystyle \sum _{i=k+1}^{n}w_{i}\leq 1/2}$

### Special case

Consider a set of elements in which two of the elements satisfy the general case. This occurs when both element's respective weights border the midpoint of the set of weights without encapsulating it; Rather, each element defines a partition equal to ${\displaystyle 1/2}$. These elements are referred to as the lower weighted median and upper weighted median. Their conditions are satisfied as follows:

Lower Weighted Median

${\displaystyle \sum _{i=1}^{k-1}w_{i}<1/2}$ and ${\displaystyle \sum _{i=k+1}^{n}w_{i}=1/2}$

Upper Weighted Median

${\displaystyle \sum _{i=1}^{k-1}w_{i}=1/2}$ and ${\displaystyle \sum _{i=k+1}^{n}w_{i}<1/2}$

Ideally, a new element would be created using the mean of the upper and lower weighted medians and assigned a weight of zero. This method is similar to finding the median of an even set. The new element would be a true median since the sum of the weights to either side of this partition point would be equal.
Depending on the application, it may not be possible or wise to create new data. In this case, the weighted median should be chosen based on which element keeps the partitions most equal. This will always be the weighted median with the lowest weight.
In the event that the upper and lower weighted medians are equal, the lower weighted median is generally accepted as originally proposed by Edgeworth. [6]

## Properties

The sum of weights in each of the two partitions should be as equal as possible.

If the weights of all numbers in the set are equal, then the weighted median reduces down to the median.

## Examples

For simplicity, consider the set of numbers ${\displaystyle \{1;2;3;4;5;\}}$ with each number having weights ${\displaystyle \{0.15;0.1;0.2;0.3;0.25;\}}$ respectively. The median is 3 and the weighted median is the element corresponding to the weight 0.3, which is 4. The weights on each side of the pivot add up to 0.45 and 0.25, satisfying the general condition that each side be as even as possible. Any other weight would result in a greater difference between each side of the pivot.

Consider the set of numbers ${\displaystyle \{1;2;3;4;\}}$ with each number having uniform weights ${\displaystyle \{0.25;0.25;0.25;0.25;\}}$ respectively. Equal weights should result in a weighted median equal to the median. This median is 2.5 since it is an even set. The lower weighted median is 2 with partition sums of 0.25 and 0.5, and the upper weighted median is 3 with partition sums of 0.5 and 0.25. These partitions each satisfy their respective special condition and the general condition. It is ideal to introduce a new pivot by taking the mean of the upper and lower weighted medians when they exist. With this, the set of numbers is ${\displaystyle \{1;2;2.5;3;4;\}}$ with each number having weights ${\displaystyle \{0.25;0.25;0;0.25;0.25;\}}$ respectively. This creates partitions that both sum to 0.5. It can easily be seen that the weighted median and median are the same for any size set with equal weights.

Similarly, consider the set of numbers ${\displaystyle \{1;2;3;4;\}}$ with each number having weights ${\displaystyle \{0.49;0.01;0.25;0.25;\}}$ respectively. The lower weighted median is 2 with partition sums of 0.49 and 0.5, and the upper weighted median is 3 with partition sums of 0.5 and 0.25. In the case of working with integers or non-interval measures, the lower weighted median would be accepted since it is the lower weight of the pair and therefore keeps the partitions most equal. However, it is more ideal to take the mean of these weighted medians when it makes sense instead. Coincidentally, both the weighted median and median are equal to 2.5, but this will not always hold true for larger sets depending on the weight distribution.

## Algorithm

Weighted median can be computed by sorting the set of numbers and finding the smallest numbers which sums to half the weight of total weight. This algorithm takes ${\displaystyle O(n\log n)}$ time. There is a better approach to find weighted median using a modified selection algorithm. [1]

// Main call is WeightedMedian(a, 1, n)// Returns lower medianWeightedMedian(a[1..n],p,r)// Base case for single elementifr=pthenreturna[p]// Base case for two elements// Make sure we return the average, in case the two candidates have equal weightifr-p=1thenifa[p].w==a[r].wreturn(a[p]+a[r])/2ifa[p].w>a[r].wreturna[p]elsereturna[r]// Partition around pivot rq=partition(a,p,r)wl,wg=sumweightsofpartitions(p,q-1),(q+1,r)// If partitions are balanced then we are doneifwlandwgboth<1/2thenreturna[q]else// Increase pivot weight by the amount of partition we eliminateifwl>wgthena[q].w+=wg// Recurse on pivot inclusively WeightedMedian(a,p,q)elsea[q].w+=wlWeightedMedian(a,q,r)

## Software/source code

• A fast weighted median algorithm is implemented in a C extension for Python in the Robustats Python package.
• R has many implementations, including matrixStats::weightedMedian(), spatstat::weighted.median(), and others. [7]

## Related Research Articles

The subset sum problem is a decision problem in computer science. In its most general formulation, there is a multiset S of integers and a target sum T, and the question is to decide whether any subset of the integers sum to precisely T. The problem is known to be NP-complete. Moreover, some restricted variants of it are NP-complete too, for example:

In mathematics, and more specifically in linear algebra, a linear subspace, also known as a vector subspace is a vector space that is a subset of some larger vector space. A linear subspace is usually simply called a subspace, when the context serves to distinguish it from other types of subspaces.

In computer science, a selection algorithm is an algorithm for finding the kth smallest number in a list or array; such a number is called the kth order statistic. This includes the cases of finding the minimum, maximum, and median elements. There are O(n)-time selection algorithms, and sublinear performance is possible for structured data; in the extreme, O(1) for an array of sorted data. Selection is a subproblem of more complex problems like the nearest neighbor and shortest path problems. Many selection algorithms are derived by generalizing a sorting algorithm, and conversely some sorting algorithms can be derived as repeated application of selection.

In statistics, a moving average is a calculation to analyze data points by creating a series of averages of different subsets of the full data set. It is also called a moving mean (MM) or rolling mean and is a type of finite impulse response filter. Variations include: simple, and cumulative, or weighted forms.

In computer science, quickselect is a selection algorithm to find the kth smallest element in an unordered list. It is related to the quicksort sorting algorithm. Like quicksort, it was developed by Tony Hoare, and thus is also known as Hoare's selection algorithm. Like quicksort, it is efficient in practice and has good average-case performance, but has poor worst-case performance. Quickselect and its variants are the selection algorithms most often used in efficient real-world implementations.

Quicksort is an efficient sorting algorithm. Developed by British computer scientist Tony Hoare in 1959 and published in 1961, it is still a commonly used algorithm for sorting. When implemented well, it can be about two or three times faster than its main competitors, merge sort and heapsort.

In the mathematical area of graph theory, Kőnig's theorem, proved by Dénes Kőnig (1931), describes an equivalence between the maximum matching problem and the minimum vertex cover problem in bipartite graphs. It was discovered independently, also in 1931, by Jenő Egerváry in the more general case of weighted graphs.

The sample mean and the sample covariance are statistics computed from a sample of data on one or more random variables.

The geometric median of a discrete set of sample points in a Euclidean space is the point minimizing the sum of distances to the sample points. This generalizes the median, which has the property of minimizing the sum of distances for one-dimensional data, and provides a central tendency in higher dimensions. It is also known as the 1-median, spatial median, Euclidean minisum point, or Torricelli point.

In multivariate statistics and the clustering of data, spectral clustering techniques make use of the spectrum (eigenvalues) of the similarity matrix of the data to perform dimensionality reduction before clustering in fewer dimensions. The similarity matrix is provided as an input and consists of a quantitative assessment of the relative similarity of each pair of points in the dataset.

In computational complexity theory, the Set Splitting problem is the following decision problem: given a family F of subsets of a finite set S, decide whether there exists a partition of S into two subsets S1, S2 such that all elements of F are split by this partition, i.e., none of the elements of F is completely in S1 or S2. Set Splitting is one of Garey&Johnson's classical NP-complete problems.

In mathematics, the Robinson–Schensted–Knuth correspondence, also referred to as the RSK correspondence or RSK algorithm, is a combinatorial bijection between matrices A with non-negative integer entries and pairs (P,Q) of semistandard Young tableaux of equal shape, whose size equals the sum of the entries of A. More precisely the weight of P is given by the column sums of A, and the weight of Q by its row sums. It is a generalization of the Robinson–Schensted correspondence, in the sense that taking A to be a permutation matrix, the pair (P,Q) will be the pair of standard tableaux associated to the permutation under the Robinson–Schensted correspondence.

Samplesort is a sorting algorithm that is a divide and conquer algorithm often used in parallel processing systems. Conventional divide and conquer sorting algorithms partitions the array into sub-intervals or buckets. The buckets are then sorted individually and then concatenated together. However, if the array is non-uniformly distributed, the performance of these sorting algorithms can be significantly throttled. Samplesort addresses this issue by selecting a sample of size s from the n-element sequence, and determining the range of the buckets by sorting the sample and choosing p−1 < s elements from the result. These elements then divide the array into p approximately equal-sized buckets. Samplesort is described in the 1970 paper, "Samplesort: A Sampling Approach to Minimal Storage Tree Sorting", by W. D. Frazer and A. C. McKellar.

In computer science, data stream clustering is defined as the clustering of data that arrive continuously such as telephone records, multimedia data, financial transactions etc. Data stream clustering is usually studied as a streaming algorithm and the objective is, given a sequence of points, to construct a good clustering of the stream, using a small amount of memory and time.

In data structures, a range query consists of preprocessing some input data into a data structure to efficiently answer any number of queries on any subset of the input. Particularly, there is a group of problems that have been extensively studied where the input is an array of unsorted numbers and a query consists of computing some function, such as the minimum, on a specific range of the array.

In computer science, the median of medians is an approximate (median) selection algorithm, frequently used to supply a good pivot for an exact selection algorithm, mainly the quickselect, that selects the kth largest element of an initially unsorted array. Median of medians finds an approximate median in linear time only, which is limited but an additional overhead for quickselect. When this approximate median is used as an improved pivot, the worst-case complexity of quickselect reduces significantly from quadratic to linear, which is also the asymptotically optimal worst-case complexity of any selection algorithm. In other words, the median of medians is an approximate median-selection algorithm that helps building an asymptotically optimal, exact general selection algorithm, by producing good pivot elements.

The cache-oblivious distribution sort is a comparison-based sorting algorithm. It is similar to quicksort, but it is a cache-oblivious algorithm, designed for a setting where the number of elements to sort is too large to fit in a cache where operations are done. In the external memory model, the number of memory transfers it needs to perform a sort of items on a machine with cache of size and cache lines of length is , under the tall cache assumption that . This number of memory transfers has been shown to be asymptotically optimal for comparison sorts. This distribution sort also achieves the asymptotically optimal runtime complexity of .

In statistics, the medcouple is a robust statistic that measures the skewness of a univariate distribution. It is defined as a scaled median difference of the left and right half of a distribution. Its robustness makes it suitable for identifying outliers in adjusted boxplots. Ordinary box plots do not fare well with skew distributions, since they label the longer unsymmetrical tails as outliers. Using the medcouple, the whiskers of a boxplot can be adjusted for skew distributions and thus have a more accurate identification of outliers for non-symmetrical distributions.

In computer science, a parallel external memory (PEM) model is a cache-aware, external-memory abstract machine. It is the parallel-computing analogy to the single-processor external memory (EM) model. In a similar way, it is the cache-aware analogy to the parallel random-access machine (PRAM). The PEM model consists of a number of processors, together with their respective private caches and a shared main memory.

In mathematics, economics, and computer science, the lattice of stable matchings is a distributive lattice whose elements are stable matchings. For a given instance of the stable matching problem, this lattice provides an algebraic description of the family of all solutions to the problem. It was originally described in the 1970s by John Horton Conway and Donald Knuth.

## References

1. Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2001). Introduction to Algorithms. ISBN   9780262032933.
2. Horowitz, Ellis; Sahni, Sartaj; Rajasekaran, Sanguthevar (1996-12-15). Computer Algorithms C++: C++ and Pseudocode Versions. ISBN   9780716783152.
3. Bovik, Alan C (2010-07-21). Handbook of Image and Video Processing. ISBN   9780080533612.
4. Edgeworth, F. Y. (1888). "On a New Method of Reducing Observations Relating to Several Quantities". Philosophical Magazine. 25 (154): 184–191. doi:10.1080/14786448808628170.
5. Edgeworth, F. Y. (1887). "On Observations Relating to Several Quantities". Hermathena. Trinity College Dublin. 6 (13): 279–285. JSTOR   23036355.
6. Lange, Kenneth (15 June 2010). Numerical Analysis for Statisticians (second ed.). Springer. p. 313. ISBN   978-1-4419-5944-7.