Partial sorting

Last updated

In computer science, partial sorting is a relaxed variant of the sorting problem. Total sorting is the problem of returning a list of items such that its elements all appear in order, while partial sorting is returning a list of the k smallest (or k largest) elements in order. The other elements (above the k smallest ones) may also be sorted, as in an in-place partial sort, or may be discarded, which is common in streaming partial sorts. A common practical example of partial sorting is computing the "Top 100" of some list.

Contents

In terms of indices, in a partially sorted list, for every index i from 1 to k, the i-th element is in the same place as it would be in the fully sorted list: element i of the partially sorted list contains order statistic i of the input list.

Offline problems

Heap-based solution

Heaps admit a simple single-pass partial sort when k is fixed: insert the first k elements of the input into a max-heap. Then make one pass over the remaining elements, add each to the heap in turn, and remove the largest element. Each insertion operation takes O(log k) time, resulting in O(n log k) time overall; this "partial heapsort" algorithm is practical for small values of k and in online settings. [1] An "online heapselect" algorithm described below, based on a min-heap, takes O(n + k log n). [1]

Solution by partitioning selection

A further relaxation requiring only a list of the k smallest elements, but without requiring that these be ordered, makes the problem equivalent to partition-based selection; the original partial sorting problem can be solved by such a selection algorithm to obtain an array where the first k elements are the k smallest, and sorting these, at a total cost of O(n + k log k) operations. A popular choice to implement this algorithm scheme is to combine quickselect and quicksort; the result is sometimes called "quickselsort". [1]

Common in current (as of 2022) C++ STL implementations is a pass of heapselect for a list of k elements, followed by a heapsort for the final result. [2]

Specialised sorting algorithms

More efficient than the aforementioned are specialized partial sorting algorithms based on mergesort and quicksort. In the quicksort variant, there is no need to recursively sort partitions which only contain elements that would fall after the k'th place in the final sorted array (starting from the "left" boundary). Thus, if the pivot falls in position k or later, we recurse only on the left partition: [3]

function partial_quicksort(A, i, j, k) isif i < j then         p ← pivot(A, i, j)         p ← partition(A, i, j, p)         partial_quicksort(A, i, p-1, k)         if p < k-1 then             partial_quicksort(A, p+1, j, k)

The resulting algorithm is called partial quicksort and requires an expected time of only O(n + k log k), and is quite efficient in practice, especially if a selection sort is used as a base case when k becomes small relative to n. However, the worst-case time complexity is still very bad, in the case of a bad pivot selection. Pivot selection along the lines of the worst-case linear time selection algorithm (see Quicksort § Choice of pivot) could be used to get better worst-case performance. Partial quicksort, quickselect (including the multiple variant), and quicksort can all be generalized into what is known as a chunksort. [1]

Incremental sorting

Incremental sorting is a version of the partial sorting problem where the input is given up front but k is unknown: given a k-sorted array, it should be possible to extend the partially sorted part so that the array becomes (k+1)-sorted. [4]

Heaps lead to an O(n + k log n) "online heapselect" solution to incremental partial sorting: first "heapify", in linear time, the complete input array to produce a min-heap. Then extract the minimum of the heap k times. [1]

A different incremental sort can be obtained by modifying quickselect. The version due to Paredes and Navarro maintains a stack of pivots across calls, so that incremental sorting can be accomplished by repeatedly requesting the smallest item of an array A from the following algorithm: [4]

Algorithm IQS(A : array, i : integer, S : stack) returns the i'th smallest element in A

  • If i = top(S):
    • Pop S
    • Return A[i]
  • Let pivot ← random [i, top(S))
  • Update pivot ← partition(A[i : top(S)), A[pivot])
  • Push pivot onto S
  • Return IQS(A, i, S)

The stack S is initialized to contain only the length n of A. k-sorting the array is done by calling IQS(A, i, S) for i = 0, 1, 2, ...; this sequence of calls has average-case complexity O(n + k log k), which is asymptotically equivalent to O(n + k log n). The worst-case time is quadratic, but this can be fixed by replacing the random pivot selection by the median of medians algorithm. [4]

Language/library support

See also

Related Research Articles

<span class="mw-page-title-main">Heapsort</span> A sorting algorithm which uses the heap data structure

In computer science, heapsort is a comparison-based sorting algorithm. Heapsort can be thought of as an improved selection sort: like selection sort, heapsort divides its input into a sorted and an unsorted region, and it iteratively shrinks the unsorted region by extracting the largest element from it and inserting it into the sorted region. Unlike selection sort, heapsort does not waste time with a linear-time scan of the unsorted region; rather, heap sort maintains the unsorted region in a heap data structure to more quickly find the largest element in each step.

<span class="mw-page-title-main">Heap (data structure)</span> Computer science data structure

In computer science, a heap is a specialized tree-based data structure which is essentially an almost complete tree that satisfies the heap property: in a max heap, for any given node C, if P is a parent node of C, then the key of P is greater than or equal to the key of C. In a min heap, the key of P is less than or equal to the key of C. The node at the "top" of the heap is called the root node.

<span class="mw-page-title-main">Insertion sort</span> Sorting algorithm

Insertion sort is a simple sorting algorithm that builds the final sorted array (or list) one item at a time by comparisons. It is much less efficient on large lists than more advanced algorithms such as quicksort, heapsort, or merge sort. However, insertion sort provides several advantages:

<span class="mw-page-title-main">Merge sort</span> Divide and conquer-based sorting algorithm

In computer science, merge sort is an efficient, general-purpose, and comparison-based sorting algorithm. Most implementations produce a stable sort, which means that the order of equal elements is the same in the input and output. Merge sort is a divide-and-conquer algorithm that was invented by John von Neumann in 1945. A detailed description and analysis of bottom-up merge sort appeared in a report by Goldstine and von Neumann as early as 1948.

Merge algorithms are a family of algorithms that take multiple sorted lists as input and produce a single list as output, containing all the elements of the inputs lists in sorted order. These algorithms are used as subroutines in various sorting algorithms, most famously merge sort.

<span class="mw-page-title-main">Sorting algorithm</span> Algorithm that arranges lists in order

In computer science, a sorting algorithm is an algorithm that puts elements of a list into an order. The most frequently used orders are numerical order and lexicographical order, and either ascending or descending. Efficient sorting is important for optimizing the efficiency of other algorithms that require input data to be in sorted lists. Sorting is also often useful for canonicalizing data and for producing human-readable output.

<span class="mw-page-title-main">Bucket sort</span> Sorting algorithm

Bucket sort, or bin sort, is a sorting algorithm that works by distributing the elements of an array into a number of buckets. Each bucket is then sorted individually, either using a different sorting algorithm, or by recursively applying the bucket sorting algorithm. It is a distribution sort, a generalization of pigeonhole sort that allows multiple keys per bucket, and is a cousin of radix sort in the most-to-least significant digit flavor. Bucket sort can be implemented with comparisons and therefore can also be considered a comparison sort algorithm. The computational complexity depends on the algorithm used to sort each bucket, the number of buckets to use, and whether the input is uniformly distributed.

Introsort or introspective sort is a hybrid sorting algorithm that provides both fast average performance and (asymptotically) optimal worst-case performance. It begins with quicksort, it switches to heapsort when the recursion depth exceeds a level based on (the logarithm of) the number of elements being sorted and it switches to insertion sort when the number of elements is below some threshold. This combines the good parts of the three algorithms, with practical performance comparable to quicksort on typical data sets and worst-case O(n log n) runtime due to the heap sort. Since the three algorithms it uses are comparison sorts, it is also a comparison sort.

In computer science, a selection algorithm is an algorithm for finding the kth smallest number in a list or array; such a number is called the kth order statistic. This includes the cases of finding the minimum, maximum, and median elements. There are O(n)-time (worst-case linear time) selection algorithms, and sublinear performance is possible for structured data; in the extreme, O(1) for an array of sorted data. Selection is a subproblem of more complex problems like the nearest neighbor and shortest path problems. Many selection algorithms are derived by generalizing a sorting algorithm, and conversely some sorting algorithms can be derived as repeated application of selection.

<span class="mw-page-title-main">Quickselect</span> Algorithm for the kth smallest element in an array

In computer science, quickselect is a selection algorithm to find the kth smallest element in an unordered list, also known as the kth order statistic. Like the related quicksort sorting algorithm, it was developed by Tony Hoare, and thus is also known as Hoare's selection algorithm. Like quicksort, it is efficient in practice and has good average-case performance, but has poor worst-case performance. Quickselect and its variants are the selection algorithms most often used in efficient real-world implementations.

<span class="mw-page-title-main">Quicksort</span> Divide and conquer sorting algorithm

Quicksort is an efficient, general-purpose sorting algorithm. Quicksort was developed by British computer scientist Tony Hoare in 1959 and published in 1961. It is still a commonly used algorithm for sorting. Overall, it is slightly faster than merge sort and heapsort for randomized data, particularly on larger distributions.

sort is a generic function in the C++ Standard Library for doing comparison sorting. The function originated in the Standard Template Library (STL).

<span class="mw-page-title-main">Tree sort</span> Type of Sorting Algorithm

A tree sort is a sort algorithm that builds a binary search tree from the elements to be sorted, and then traverses the tree (in-order) so that the elements come out in sorted order. Its typical use is sorting elements online: after each insertion, the set of elements seen so far is available in sorted order.

Spreadsort is a sorting algorithm invented by Steven J. Ross in 2002. It combines concepts from distribution-based sorts, such as radix sort and bucket sort, with partitioning concepts from comparison sorts such as quicksort and mergesort. In experimental results it was shown to be highly efficient, often outperforming traditional algorithms such as quicksort, particularly on distributions exhibiting structure and string sorting. There is an open-source implementation with performance analysis and benchmarks, and HTML documentation .

In computer science, introselect is a selection algorithm that is a hybrid of quickselect and median of medians which has fast average performance and optimal worst-case performance. Introselect is related to the introsort sorting algorithm: these are analogous refinements of the basic quickselect and quicksort algorithms, in that they both start with the quick algorithm, which has good average performance and low overhead, but fall back to an optimal worst-case algorithm if the quick algorithm does not progress rapidly enough. Both algorithms were introduced by David Musser in, with the purpose of providing generic algorithms for the C++ Standard Library that have both fast average performance and optimal worst-case performance, thus allowing the performance requirements to be tightened.

Samplesort is a sorting algorithm that is a divide and conquer algorithm often used in parallel processing systems. Conventional divide and conquer sorting algorithms partitions the array into sub-intervals or buckets. The buckets are then sorted individually and then concatenated together. However, if the array is non-uniformly distributed, the performance of these sorting algorithms can be significantly throttled. Samplesort addresses this issue by selecting a sample of size s from the n-element sequence, and determining the range of the buckets by sorting the sample and choosing p−1 < s elements from the result. These elements then divide the array into p approximately equal-sized buckets. Samplesort is described in the 1970 paper, "Samplesort: A Sampling Approach to Minimal Storage Tree Sorting", by W. D. Frazer and A. C. McKellar.

In computer science, the median of medians is an approximate median selection algorithm, frequently used to supply a good pivot for an exact selection algorithm, most commonly quickselect, that selects the kth smallest element of an initially unsorted array. Median of medians finds an approximate median in linear time. Using this approximate median as an improved pivot, the worst-case complexity of quickselect reduces from quadratic to linear, which is also the asymptotically optimal worst-case complexity of any selection algorithm. In other words, the median of medians is an approximate median-selection algorithm that helps building an asymptotically optimal, exact general selection algorithm, by producing good pivot elements.

In computer science, the Floyd-Rivest algorithm is a selection algorithm developed by Robert W. Floyd and Ronald L. Rivest that has an optimal expected number of comparisons within lower-order terms. It is functionally equivalent to quickselect, but runs faster in practice on average. It has an expected running time of O(n) and an expected number of comparisons of n + min(k, nk) + O(n1/2 log1/2n).

In computer science, k-way merge algorithms or multiway merges are a specific type of sequence merge algorithms that specialize in taking in k sorted lists and merging them into a single sorted list. These merge algorithms generally refer to merge algorithms that take in a number of sorted lists greater than two. Two-way merges are also referred to as binary merges.

Proportion extend sort is an in-place, comparison-based sorting algorithm which attempts to improve on the performance, particularly the worst-case performance, of quicksort.

References

  1. 1 2 3 4 5 Conrado Martínez (2004). On partial sorting (PDF). 10th Seminar on the Analysis of Algorithms.
  2. "std::partial_sort". en.cppreference.com.
  3. Martínez, Conrado (2004). Partial quicksort (PDF). Proc. 6th ACM-SIAM Workshop on Algorithm Engineering and Experiments and 1st ACM-SIAM Workshop on Analytic Algorithmics and Combinatorics.
  4. 1 2 3 Paredes, Rodrigo; Navarro, Gonzalo (2006). "Optimal Incremental Sorting". Proc. Eighth Workshop on Algorithm Engineering and Experiments (ALENEX). pp. 171–182. CiteSeerX   10.1.1.218.4119 . doi:10.1137/1.9781611972863.16. ISBN   978-1-61197-286-3.