Median of medians

Last updated
Median of Medians
Median Of Medians - Color.png
Median of medians
Class Selection algorithm
Data structure Array
Worst-case performance
Best-case performance
Worst-case space complexity auxiliary
OptimalYes

In computer science, the median of medians is an approximate median selection algorithm, frequently used to supply a good pivot for an exact selection algorithm, most commonly quickselect, that selects the kth smallest element of an initially unsorted array. Median of medians finds an approximate median in linear time. Using this approximate median as an improved pivot, the worst-case complexity of quickselect reduces from quadratic to linear, which is also the asymptotically optimal worst-case complexity of any selection algorithm. In other words, the median of medians is an approximate median-selection algorithm that helps building an asymptotically optimal, exact general selection algorithm (especially in the sense of worst-case complexity), by producing good pivot elements.

Contents

Median of medians can also be used as a pivot strategy in quicksort, yielding an optimal algorithm, with worst-case complexity .[ citation needed ] Although this approach optimizes the asymptotic worst-case complexity quite well, it is typically outperformed in practice by instead choosing random pivots for its average complexity for selection and average complexity for sorting, without any overhead of computing the pivot.

Similarly, Median of medians is used in the hybrid introselect algorithm as a fallback for pivot selection at each iteration until kth smallest is found. This again ensures a worst-case linear performance, in addition to average-case linear performance: introselect starts with quickselect (with random pivot, default), to obtain good average performance, and then falls back to modified quickselect with pivot obtained from median of medians if the progress is too slow. Even though asymptotically similar, such a hybrid algorithm will have a lower complexity than a straightforward introselect up to a constant factor (both in average-case and worst-case), at any finite length.

The algorithm was published in Blum et al. (1973), and thus is sometimes called BFPRT after the last names of the authors. In the original paper the algorithm was referred to as PICK, referring to quickselect as "FIND".

Motivation

Quickselect is linear-time on average, but it can require quadratic time with poor pivot choices. This is because quickselect is a divide and conquer algorithm, with each step taking time in the size of the remaining search set. If the search set decreases exponentially quickly in size (by a fixed proportion), this yields a geometric series times the factor of a single step, and thus linear overall time. However, if the search set decreases slowly in size, such as linearly (by a fixed number of elements, in the worst case only reducing by one element each time), then a linear sum of linear steps yields quadratic overall time (formally, triangular numbers grow quadratically). For example, the worst-case occurs when pivoting on the smallest element at each step, such as applying quickselect for the maximum element to already sorted data and taking the first element as pivot each time.

If one instead consistently chooses "good" pivots, this is avoided and one always gets linear performance even in the worst case. A "good" pivot is one for which we can establish that a constant proportion of elements fall both below and above it, as then the search set decreases at least by a constant proportion at each step, hence exponentially quickly, and the overall time remains linear. The median is a good pivot – the best for sorting, and the best overall choice for selection – decreasing the search set by half at each step. Thus if one can compute the median in linear time, this only adds linear time to each step, and thus the overall complexity of the algorithm remains linear.

The median-of-medians algorithm computes an approximate median, namely a point that is guaranteed to be between the 30th and 70th percentiles (in the middle 4 deciles). Thus the search set decreases by at least 30%. The problem is reduced to 70% of the original size, which is a fixed proportion smaller. Applying the same algorithm on the now smaller set recursively until only one or two elements remain results in a cost of

Algorithm

As stated before, median-of-medians is used as a pivot selection strategy in the quickselect algorithm, which in pseudocode looks as follows.

function nthSmallest(list, n)     index := select(list, 1, length of list, n) // Use select(list, 0, length of list - 1, n - 1) if list is zero-based numbering return list[index]

Be careful to handle left, right and n when implementing. The following pseudocode assumes that left, right, and the list use one-based numbering and that select is initially called with 1 as the argument to left and the length of the list as the argument to right. Note that this returns the index of the n'th smallest number after rearranging the list, rather than the actual value of the n'th smallest number.

function select(list, left, right, n)     loopif left = right thenreturn left         pivotIndex := pivot(list, left, right)         pivotIndex := partition(list, left, right, pivotIndex, n)         if n = pivotIndex thenreturn n         else if n < pivotIndex then             right := pivotIndex - 1         else             left := pivotIndex + 1

Subroutine pivot is the actual median-of-medians algorithm. It divides its input (a list of length n) into groups of at most five elements, computes the median of each of those groups using some subroutine, then recursively computes the true median of the medians found in the previous step:. [1] Note that pivot calls select; this is an instance of mutual recursion.

function pivot(list, left, right)     // for 5 or less elements just get medianif right − left < 5 thenreturn partition5(list, left, right)     // otherwise move the medians of five-element subgroups to the first n/5 positionsfor i from left to right in steps of 5         // get the median position of the i'th five-element subgroup         subRight := i + 4         if subRight > right then             subRight := right         median5 := partition5(list, i, subRight)         swap list[median5] and list[left + floor((i − left) / 5)]      // compute the median of the n/5 medians-of-five     mid := floor((right − left) / 10) + left + 1     return select(list, left, left + floor((right − left) / 5), mid)

Partition helper functions

There is a subroutine called partition that can, in linear time, group a list (ranging from indices left to right) into three parts, those less than a certain element, those equal to it, and those greater than the element (a three-way partition). The grouping into three parts ensures that the median-of-medians maintains linear execution time in a case of many or all coincident elements. Here is pseudocode that performs a partition about the element list[pivotIndex]:

function partition(list, left, right, pivotIndex, n)     pivotValue := list[pivotIndex]     swap list[pivotIndex] and list[right]  // Move pivot to end     storeIndex := left     // Move all elements smaller than the pivot to the left of the pivotfor i from left to right − 1 doif list[i] < pivotValue then             swap list[storeIndex] and list[i]             increment storeIndex     // Move all elements equal to the pivot right after// the smaller elements     storeIndexEq := storeIndex     for i from storeIndex to right − 1 doif list[i] = pivotValue then             swap list[storeIndexEq] and list[i]             increment storeIndexEq     swap list[right] and list[storeIndexEq]  // Move pivot to its final place// Return location of pivot considering the desired location nif n < storeIndex thenreturn storeIndex  // n is in the group of smaller elementsif n ≤ storeIndexEq thenreturn n  // n is in the group equal to pivotreturn storeIndexEq // n is in the group of larger elements

The partition5 subroutine selects the median of a group of at most five elements; an easy way to implement this is insertion sort, as shown below. [1] It can also be implemented as a decision tree.

function partition5(list, left, right)     i := left + 1     while i ≤ right do         j := i         while j > left and list[j−1] > list[j] do             swap list[j−1] and list[j]             j := j − 1         i :=  i + 1      return left + (right - left) / 2

Properties of pivot

Of the groups, half the number of groups have their median less than the pivot (Median of Medians). Also, another half the number of groups have their median greater than the pivot. In each of the groups with median less than the pivot, there are two elements that are smaller than their respective medians, which are smaller than the pivot. Thus, each of the groups have at least 3 elements that are smaller than the pivot. Similarly, in each of the groups with median greater than the pivot, there are two elements that are greater than their respective medians, which are greater than the pivot. Thus, each of the groups have at least 3 elements that are greater than the pivot. Hence, the pivot is less than elements and greater than another elements. Thus the chosen median splits the ordered elements somewhere between 30%/70% and 70%/30%, which assures worst-case linear behavior of the algorithm. To visualize:

One iteration on a randomized set of 100 elements from 0 to 99
121511295073214440118203219353739
1316148102663342749465225513443567279
Medians1723242829303136424750555860636566678183
2245385361416282544859577178648070768587
9695948689696897739274889984759077939891

(red = "(one of the two possible) median of medians", gray = "number < red", white = "number > red")

5-tuples are shown here sorted by median, for clarity. Sorting the tuples is not necessary because we only need the median for use as pivot element.

Note that all elements above/left of the red (30% of the 100 elements) are less, and all elements below/right of the red (another 30% of the 100 elements) are greater.

Proof of linear running time

The median-calculating recursive call does not exceed worst-case linear behavior because the list of medians has size , while the other recursive call recurses on at most 70% of the list. Let be the time it takes to run a median-of-medians Quickselect algorithm on an array of size . Then we know this time is:

where

By induction:

Analysis

The key step is reducing the problem to selecting in two lists whose total length is shorter than the original list, plus a linear factor for the reduction step. This allows a simple induction to show that the overall running time is linear.

The specific choice of groups of five elements is explained as follows. Firstly, computing median of an odd list is faster and simpler; while one could use an even list, this requires taking the average of the two middle elements, which is slower than simply selecting the single exact middle element. Secondly, five is the smallest odd number such that median of medians works. With groups of only three elements, the resulting list of medians to search in is length , and reduces the list to recurse into length , since it is greater than 1/2 × 2/3 = 1/3 of the elements and less than 1/2 × 2/3 = 1/3 of the elements. Thus this still leaves elements to search in, not reducing the problem sufficiently. The individual lists are shorter, however, and one can bound the resulting complexity to by the Akra–Bazzi method, but it does not prove linearity.

Conversely, one may instead group by = seven, nine, or more elements, and this does work. This reduces the size of the list of medians to , and the size of the list to recurse into asymptotes at 3n/4 (75%), as the quadrants in the above table approximate 25%, as the size of the overlapping lines decreases proportionally. This reduces the scaling factor from 10 asymptotically to 4, but accordingly raises the term for the partitioning work. Finding the median of a larger group takes longer, but is a constant factor (depending only on ), and thus does not change the overall performance as n grows. In fact, considering the number of comparisons in the worst case, the constant factor is .

If one instead groups the other way, say dividing the element list into 5 lists, computing the median of each, and then computing the median of these – i.e., grouping by a constant fraction, not a constant number – one does not as clearly reduce the problem, since it requires computing 5 medians, each in a list of elements, and then recursing on a list of length at most . As with grouping by 3, the individual lists are shorter, but the overall length is no shorter – in fact longer – and thus one can only prove superlinear bounds. Grouping into a square of lists of length is similarly complicated.

Related Research Articles

<span class="mw-page-title-main">Analysis of algorithms</span> Study of resources used by an algorithm

In computer science, the analysis of algorithms is the process of finding the computational complexity of algorithms—the amount of time, storage, or other resources needed to execute them. Usually, this involves determining a function that relates the size of an algorithm's input to the number of steps it takes or the number of storage locations it uses. An algorithm is said to be efficient when this function's values are small, or grow slowly compared to a growth in the size of the input. Different inputs of the same size may cause the algorithm to have different behavior, so best, worst and average case descriptions might all be of practical interest. When not otherwise specified, the function describing the performance of an algorithm is usually an upper bound, determined from the worst case inputs to the algorithm.

<span class="mw-page-title-main">Binary search</span> Search algorithm finding the position of a target value within a sorted array

In computer science, binary search, also known as half-interval search, logarithmic search, or binary chop, is a search algorithm that finds the position of a target value within a sorted array. Binary search compares the target value to the middle element of the array. If they are not equal, the half in which the target cannot lie is eliminated and the search continues on the remaining half, again taking the middle element to compare to the target value, and repeating this until the target value is found. If the search ends with the remaining half being empty, the target is not in the array.

<span class="mw-page-title-main">Gaussian elimination</span> Algorithm for solving systems of linear equations

In mathematics, Gaussian elimination, also known as row reduction, is an algorithm for solving systems of linear equations. It consists of a sequence of row-wise operations performed on the corresponding matrix of coefficients. This method can also be used to compute the rank of a matrix, the determinant of a square matrix, and the inverse of an invertible matrix. The method is named after Carl Friedrich Gauss (1777–1855). To perform row reduction on a matrix, one uses a sequence of elementary row operations to modify the matrix until the lower left-hand corner of the matrix is filled with zeros, as much as possible. There are three types of elementary row operations:

<span class="mw-page-title-main">Merge sort</span> Divide and conquer sorting algorithm

In computer science, merge sort is an efficient, general-purpose, and comparison-based sorting algorithm. Most implementations produce a stable sort, which means that the relative order of equal elements is the same in the input and output. Merge sort is a divide-and-conquer algorithm that was invented by John von Neumann in 1945. A detailed description and analysis of bottom-up merge sort appeared in a report by Goldstine and von Neumann as early as 1948.

<span class="mw-page-title-main">Sorting algorithm</span> Algorithm that arranges lists in order

In computer science, a sorting algorithm is an algorithm that puts elements of a list into an order. The most frequently used orders are numerical order and lexicographical order, and either ascending or descending. Efficient sorting is important for optimizing the efficiency of other algorithms that require input data to be in sorted lists. Sorting is also often useful for canonicalizing data and for producing human-readable output.

<span class="mw-page-title-main">Binary heap</span> Variant of heap data structure

A binary heap is a heap data structure that takes the form of a binary tree. Binary heaps are a common way of implementing priority queues. The binary heap was introduced by J. W. J. Williams in 1964 as a data structure for implementing heapsort.

<span class="mw-page-title-main">Shellsort</span> Sorting algorithm which uses multiple comparison intervals

Shellsort, also known as Shell sort or Shell's method, is an in-place comparison sort. It can be seen as either a generalization of sorting by exchange or sorting by insertion. The method starts by sorting pairs of elements far apart from each other, then progressively reducing the gap between elements to be compared. By starting with far-apart elements, it can move some out-of-place elements into the position faster than a simple nearest-neighbor exchange. Donald Shell published the first version of this sort in 1959. The running time of Shellsort is heavily dependent on the gap sequence it uses. For many practical variants, determining their time complexity remains an open problem.

<span class="mw-page-title-main">Bucket sort</span> Sorting algorithm

Bucket sort, or bin sort, is a sorting algorithm that works by distributing the elements of an array into a number of buckets. Each bucket is then sorted individually, either using a different sorting algorithm, or by recursively applying the bucket sorting algorithm. It is a distribution sort, a generalization of pigeonhole sort that allows multiple keys per bucket, and is a cousin of radix sort in the most-to-least significant digit flavor. Bucket sort can be implemented with comparisons and therefore can also be considered a comparison sort algorithm. The computational complexity depends on the algorithm used to sort each bucket, the number of buckets to use, and whether the input is uniformly distributed.

Introsort or introspective sort is a hybrid sorting algorithm that provides both fast average performance and (asymptotically) optimal worst-case performance. It begins with quicksort, it switches to heapsort when the recursion depth exceeds a level based on (the logarithm of) the number of elements being sorted and it switches to insertion sort when the number of elements is below some threshold. This combines the good parts of the three algorithms, with practical performance comparable to quicksort on typical data sets and worst-case O(n log n) runtime due to the heap sort. Since the three algorithms it uses are comparison sorts, it is also a comparison sort.

<span class="mw-page-title-main">Time complexity</span> Estimate of time taken for running an algorithm

In theoretical computer science, the time complexity is the computational complexity that describes the amount of computer time it takes to run an algorithm. Time complexity is commonly estimated by counting the number of elementary operations performed by the algorithm, supposing that each elementary operation takes a fixed amount of time to perform. Thus, the amount of time taken and the number of elementary operations performed by the algorithm are taken to be related by a constant factor.

In computer science, a selection algorithm is an algorithm for finding the th smallest value in a collection of ordered values, such as numbers. The value that it finds is called the th order statistic. Selection includes as special cases the problems of finding the minimum, median, and maximum element in the collection. Selection algorithms include quickselect, and the median of medians algorithm. When applied to a collection of values, these algorithms take linear time, as expressed using big O notation. For data that is already structured, faster algorithms may be possible; as an extreme case, selection in an already-sorted array takes time .

In computer science, a disjoint-set data structure, also called a union–find data structure or merge–find set, is a data structure that stores a collection of disjoint (non-overlapping) sets. Equivalently, it stores a partition of a set into disjoint subsets. It provides operations for adding new sets, merging sets, and finding a representative member of a set. The last operation makes it possible to find out efficiently if any two elements are in the same or different sets.

<i>k</i>-d tree Multidimensional search tree for points in k dimensional space

In computer science, a k-d tree is a space-partitioning data structure for organizing points in a k-dimensional space. K-dimensional is that which concerns exactly k orthogonal axes or a space of any number of dimensions. k-d trees are a useful data structure for several applications, such as:

<span class="mw-page-title-main">Quickselect</span> Algorithm for the kth smallest element in an array

In computer science, quickselect is a selection algorithm to find the kth smallest element in an unordered list, also known as the kth order statistic. Like the related quicksort sorting algorithm, it was developed by Tony Hoare, and thus is also known as Hoare's selection algorithm. Like quicksort, it is efficient in practice and has good average-case performance, but has poor worst-case performance. Quickselect and its variants are the selection algorithms most often used in efficient real-world implementations.

<span class="mw-page-title-main">Quicksort</span> Divide and conquer sorting algorithm

Quicksort is an efficient, general-purpose sorting algorithm. Quicksort was developed by British computer scientist Tony Hoare in 1959 and published in 1961. It is still a commonly used algorithm for sorting. Overall, it is slightly faster than merge sort and heapsort for randomized data, particularly on larger distributions.

Samplesort is a sorting algorithm that is a divide and conquer algorithm often used in parallel processing systems. Conventional divide and conquer sorting algorithms partitions the array into sub-intervals or buckets. The buckets are then sorted individually and then concatenated together. However, if the array is non-uniformly distributed, the performance of these sorting algorithms can be significantly throttled. Samplesort addresses this issue by selecting a sample of size s from the n-element sequence, and determining the range of the buckets by sorting the sample and choosing p−1 < s elements from the result. These elements then divide the array into p approximately equal-sized buckets. Samplesort is described in the 1970 paper, "Samplesort: A Sampling Approach to Minimal Storage Tree Sorting", by W. D. Frazer and A. C. McKellar.

In computer science, the range query problem consists of efficiently answering several queries regarding a given interval of elements within an array. For example, a common task, known as range minimum query, is finding the smallest value inside a given range within a list of numbers.

In computer science, partial sorting is a relaxed variant of the sorting problem. Total sorting is the problem of returning a list of items such that its elements all appear in order, while partial sorting is returning a list of the k smallest elements in order. The other elements may also be sorted, as in an in-place partial sort, or may be discarded, which is common in streaming partial sorts. A common practical example of partial sorting is computing the "Top 100" of some list.

<span class="mw-page-title-main">Medcouple</span> Statistics term

In statistics, the medcouple is a robust statistic that measures the skewness of a univariate distribution. It is defined as a scaled median difference between the left and right half of a distribution. Its robustness makes it suitable for identifying outliers in adjusted boxplots. Ordinary box plots do not fare well with skew distributions, since they label the longer unsymmetrical tails as outliers. Using the medcouple, the whiskers of a boxplot can be adjusted for skew distributions and thus have a more accurate identification of outliers for non-symmetrical distributions.

<span class="mw-page-title-main">Parallel external memory</span>

In computer science, a parallel external memory (PEM) model is a cache-aware, external-memory abstract machine. It is the parallel-computing analogy to the single-processor external memory (EM) model. In a similar way, it is the cache-aware analogy to the parallel random-access machine (PRAM). The PEM model consists of a number of processors, together with their respective private caches and a shared main memory.

References

  1. 1 2 Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2009) [1990]. Introduction to Algorithms (3rd ed.). MIT Press and McGraw-Hill. p. 220. ISBN   0-262-03384-4.