Comparison sort

Last updated
Sorting a set of unlabelled weights by weight using only a balance scale requires a comparison sort algorithm. Balance a tabac 1850.JPG
Sorting a set of unlabelled weights by weight using only a balance scale requires a comparison sort algorithm.

A comparison sort is a type of sorting algorithm that only reads the list elements through a single abstract comparison operation (often a "less than or equal to" operator or a three-way comparison) that determines which of two elements should occur first in the final sorted list. The only requirement is that the operator forms a total preorder over the data, with:

Contents

  1. if ab and bc then ac (transitivity)
  2. for all a and b, ab or ba (connexity).

It is possible that both ab and ba; in this case either may come first in the sorted list. In a stable sort, the input order determines the sorted order in this case.

Comparison sorts studied in the literature are "comparison-based". [1] Elements a and b can be swapped or otherwise re-arranged by the algorithm only when the order between these elements has been established based on the outcomes of prior comparisons. This is the case when the order between a and b can be derived via the transitive closure of these prior comparison outcomes.

For comparison-based sorts the decision to execute basic operations other than comparisons is based on the outcome of comparisons. Hence in a time analysis the number of executed comparisons is used to determine upper bound estimates for the number of executed basic operations such as swaps or assignments. [1]

A metaphor for thinking about comparison sorts is that someone has a set of unlabelled weights and a balance scale. Their goal is to line up the weights in order by their weight without any information except that obtained by placing two weights on the scale and seeing which one is heavier (or if they weigh the same).

Examples

Quicksort in action on a list of numbers. The horizontal lines are pivot values. Sorting quicksort anim.gif
Quicksort in action on a list of numbers. The horizontal lines are pivot values.

Some of the most well-known comparison sorts include:

Performance limits and advantages of different sorting techniques

There are fundamental limits on the performance of comparison sorts. A comparison sort must have an average-case lower bound of Ω(n log n) comparison operations, [2] which is known as linearithmic time. This is a consequence of the limited information available through comparisons alone — or, to put it differently, of the vague algebraic structure of totally ordered sets. In this sense, mergesort, heapsort, and introsort are asymptotically optimal in terms of the number of comparisons they must perform, although this metric neglects other operations. Non-comparison sorts (such as the examples discussed below) can achieve O(n) performance by using operations other than comparisons, allowing them to sidestep this lower bound (assuming elements are constant-sized).

Comparison sorts may run faster on some lists; many adaptive sorts such as insertion sort run in O(n) time on an already-sorted or nearly-sorted list. The Ω(n log n) lower bound applies only to the case in which the input list can be in any possible order.

Real-world measures of sorting speed may need to take into account the ability of some algorithms to optimally use relatively fast cached computer memory, or the application may benefit from sorting methods where sorted data begins to appear to the user quickly (and then user's speed of reading will be the limiting factor) as opposed to sorting methods where no output is available until the whole list is sorted.

Despite these limitations, comparison sorts offer the notable practical advantage that control over the comparison function allows sorting of many different datatypes and fine control over how the list is sorted. For example, reversing the result of the comparison function allows the list to be sorted in reverse; and one can sort a list of tuples in lexicographic order by just creating a comparison function that compares each part in sequence:

function tupleCompare((lefta, leftb, leftc), (righta, rightb, rightc))     if lefta ≠ righta         return compare(lefta, righta)     else if leftb ≠ rightb         return compare(leftb, rightb)     elsereturn compare(leftc, rightc)

Comparison sorts generally adapt more easily to complex orders such as the order of floating-point numbers. Additionally, once a comparison function is written, any comparison sort can be used without modification; non-comparison sorts typically require specialized versions for each datatype.

This flexibility, together with the efficiency of the above comparison sorting algorithms on modern computers, has led to widespread preference for comparison sorts in most practical work.

Alternatives

Some sorting problems admit a strictly faster solution than the Ω(n log n) bound for comparison sorting by using non-comparison sorts; an example is integer sorting, where all keys are integers. When the keys form a small (compared to n) range, counting sort is an example algorithm that runs in linear time. Other integer sorting algorithms, such as radix sort, are not asymptotically faster than comparison sorting, but can be faster in practice.

The problem of sorting pairs of numbers by their sum is not subject to the Ω(n² log n) bound either (the square resulting from the pairing up); the best known algorithm still takes O(n² log n) time, but only O(n²) comparisons.

Number of comparisons required to sort a list

nMinimum
100
211
333
455
577
61010
71313
81616
91919
102222
112626
122930 [3] [4]
133334 [5] [6] [7]
143738 [7]
154142 [8] [9] [10]
164546 [11]
174950 [11]
185354 [11]
195758 [10]
206262
216666
227071 [7]
n
102219
100525521
1 0008 5308 524
10 000118 459118 451
100 0001 516 7051 516 695
1 000 00018 488 88518 488 874
Above: A comparison of the lower bound to the actual minimum number of comparisons (from OEIS:  A036604 ) required to sort a list of n items (for the worst case). Below: Using Stirling's approximation, this lower bound is well-approximated by .

The number of comparisons that a comparison sort algorithm requires increases in proportion to , where is the number of elements to sort. This bound is asymptotically tight.

Given a list of distinct numbers (we can assume this because this is a worst-case analysis), there are factorial permutations exactly one of which is the list in sorted order. The sort algorithm must gain enough information from the comparisons to identify the correct permutation. If the algorithm always completes after at most steps, it cannot distinguish more than cases because the keys are distinct and each comparison has only two possible outcomes. Therefore,

, or equivalently

By looking at the first factors of , we obtain

This provides the lower-bound part of the claim. A more precise bound can be given via Stirling's approximation. An upper bound of the same form, with the same leading term as the bound obtained from Stirling's approximation, follows from the existence of the algorithms that attain this bound in the worst case, like merge sort.

The above argument provides an absolute, rather than only asymptotic lower bound on the number of comparisons, namely comparisons. This lower bound is fairly good (it can be approached within a linear tolerance by a simple merge sort), but it is known to be inexact. For example, , but the minimal number of comparisons to sort 13 elements has been proved to be 34.

Determining the exact number of comparisons needed to sort a given number of entries is a computationally hard problem even for small n, and no simple formula for the solution is known. For some of the few concrete values that have been computed, see OEIS:  A036604 .

Lower bound for the average number of comparisons

A similar bound applies to the average number of comparisons. Assuming that

it is impossible to determine which order the input is in with fewer than log2(n!) comparisons on average.

This can be most easily seen using concepts from information theory. The Shannon entropy of such a random permutation is log2(n!) bits. Since a comparison can give only two results, the maximum amount of information it provides is 1 bit. Therefore, after k comparisons the remaining entropy of the permutation, given the results of those comparisons, is at least log2(n!)  k bits on average. To perform the sort, complete information is needed, so the remaining entropy must be 0. It follows that k must be at least log2(n!) on average.

The lower bound derived via information theory is phrased as 'information-theoretic lower bound'. Information-theoretic lower bound is correct but is not necessarily the strongest lower bound. And in some cases, the information-theoretic lower bound of a problem may even be far from the true lower bound. For example, the information-theoretic lower bound of selection is whereas comparisons are needed by an adversarial argument. The interplay between information-theoretic lower bound and the true lower bound are much like a real-valued function lower-bounding an integer function. However, this is not exactly correct when the average case is considered.

To unearth what happens while analyzing the average case, the key is that what does 'average' refer to? Averaging over what? With some knowledge of information theory, the information-theoretic lower bound averages over the set of all permutations as a whole. But any computer algorithms (under what are believed currently) must treat each permutation as an individual instance of the problem. Hence, the average lower bound we're searching for is averaged over all individual cases.

To search for the lower bound relating to the non-achievability of computers, we adopt the Decision tree model. Let's rephrase a bit of what our objective is. In the Decision tree model, the lower bound to be shown is the lower bound of the average length of root-to-leaf paths of an -leaf binary tree (in which each leaf corresponds to a permutation). The minimum average length of a binary tree with a given number of leaves is achieved by a balanced full binary tree, because any other binary tree can have its path length reduced by moving a pair of leaves to a higher position. With some careful calculations, for a balanced full binary tree with leaves, the average length of root-to-leaf paths is given by

For example, for n = 3, the information-theoretic lower bound for the average case is approximately 2.58, while the average lower bound derived via Decision tree model is 8/3, approximately 2.67.

In the case that multiple items may have the same key, there is no obvious statistical interpretation for the term "average case", so an argument like the above cannot be applied without making specific assumptions about the distribution of keys.

n log n maximum number of comparisons for array-size in format 2^k

Can easy compute for real algorithm sorted-list-merging (array are sorted n-blocks with size 1, merge to 1–1 to 2, merge 2–2 to 4...).

(1) = = = = = = = =  (2) =   =   =   =     // max 1 compares (size1+size2-1), 4x repeats to concat 8 arrays with size 1 and 1    === === === ===  (3)   =       =       // max 7 compares, 2x repeats to concat 4 arrays with size 2 and 2      ===     ===       =====   =====     ======= =======  (4)                   // max 15 compares, 1x repeats to concat 2 arrays with size 4 and 4  Formula extraction: n = 256 = 2^8 (array size in format 2^k, for simplify) On = (n-1) + 2(n/2-1) + 4(n/4-1) + 8(n/8-1) + 16(n/16-1) + 32(n/32-1) + 64(n/64-1) + 128(n/128-1) On = (n-1) + (n-2) + (n-4) + (n-8) + (n-16) + (n-32) + (n-64) + (n-128) On = n+n+n+n+n+n+n+n - (1+2+4+8+16+32+64+128)   | 1+2+4... = formula for geometric sequence Sn = a1 * (q^i - 1) / (n - 1), n is number of items, a1 is first item On = 8*n - 1 * (2^8 - 1) / (2 - 1) On = 8*n - (2^8 - 1)   | 2^8 = n On = 8*n - (n - 1) On = (8-1)*n + 1   | 8 = ln(n)/ln(2) = ln(256)/ln(2) On = (ln(n)/ln(2) - 1) * n + 1  Example: n = 2^4 = 16, On ~= 3*n n = 2^8 = 256, On ~= 7*n n = 2^10 = 1.024, On ~= 9*n n = 2^20 = 1.048.576, On ~= 19*n 

Sorting a pre-sorted list

If a list is already close to sorted, according to some measure of sortedness, the number of comparisons required to sort it can be smaller. An adaptive sort takes advantage of this "presortedness" and runs more quickly on nearly-sorted inputs, often while still maintaining an worst case time bound. An example is adaptive heap sort, a sorting algorithm based on Cartesian trees. It takes time , where is the average, over all values in the sequence, of the number of times the sequence jumps from below to above or vice versa. [12]

Notes

[13]

Notes

  1. 1 2 Wilkes, M. V. (1974-11-01). "The Art of Computer Programming, Volume 3, Sorting and Searching". The Computer Journal. 17 (4): 324–324. doi:10.1093/comjnl/17.4.324. ISSN   0010-4620.
  2. Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2009) [1990]. Introduction to Algorithms (3rd ed.). MIT Press and McGraw-Hill. pp. 191–193. ISBN   0-262-03384-4.
  3. Mark Wells, Applications of a language for computing in combinatorics, Information Processing 65 (Proceedings of the 1965 IFIP Congress), 497–498, 1966.
  4. Mark Wells, Elements of Combinatorial Computing, Pergamon Press, Oxford, 1971.
  5. Takumi Kasai, Shusaku Sawato, Shigeki Iwata, Thirty four comparisons are required to sort 13 items, LNCS 792, 260-269, 1994.
  6. Marcin Peczarski, Sorting 13 elements requires 34 comparisons, LNCS 2461, 785–794, 2002.
  7. 1 2 3 Marcin Peczarski, New results in minimum-comparison sorting, Algorithmica 40 (2), 133–145, 2004.
  8. Marcin Peczarski, Computer assisted research of posets, PhD thesis, University of Warsaw, 2006.
  9. Peczarski, Marcin (2007). "The Ford-Johnson algorithm is still unbeaten for less than 47 elements". Inf. Process. Lett. 101 (3): 126–128. doi:10.1016/j.ipl.2006.09.001.
  10. 1 2 Cheng, Weiyi; Liu, Xiaoguang; Wang, Gang; Liu, Jing (October 2007). "最少比较排序问题中S(15)和S(19)的解决" [The results of S(15) and S(19) to minimum-comparison sorting problem]. Journal of Frontiers of Computer Science and Technology (in Chinese). 1 (3): 305–313.
  11. 1 2 3 Stober, F., & Weiß, A. (2023). Lower Bounds for Sorting 16, 17, and 18 Elements. In 2023 Proceedings of the Symposium on Algorithm Engineering and Experiments (ALENEX) (pp. 201-213). Society for Industrial and Applied Mathematics
  12. Levcopoulos, Christos; Petersson, Ola (1989), "Heapsort - Adapted for Presorted Files", WADS '89: Proceedings of the Workshop on Algorithms and Data Structures, Lecture Notes in Computer Science, vol. 382, London, UK: Springer-Verlag, pp. 499–509, doi:10.1007/3-540-51542-9_41 .
  13. Knuth, Donald E. (1998). The art of computer programming, volume 3: (2nd ed.) sorting and searching. USA: Addison Wesley Longman Publishing Co., Inc. ISBN   978-0-201-89685-5.

Related Research Articles

<span class="mw-page-title-main">Binary search algorithm</span> Search algorithm finding the position of a target value within a sorted array

In computer science, binary search, also known as half-interval search, logarithmic search, or binary chop, is a search algorithm that finds the position of a target value within a sorted array. Binary search compares the target value to the middle element of the array. If they are not equal, the half in which the target cannot lie is eliminated and the search continues on the remaining half, again taking the middle element to compare to the target value, and repeating this until the target value is found. If the search ends with the remaining half being empty, the target is not in the array.

In computer science, the computational complexity or simply complexity of an algorithm is the amount of resources required to run it. Particular focus is given to computation time and memory storage requirements. The complexity of a problem is the complexity of the best algorithms that allow solving the problem.

<span class="mw-page-title-main">Merge sort</span> Divide and conquer-based sorting algorithm

In computer science, merge sort is an efficient, general-purpose, and comparison-based sorting algorithm. Most implementations produce a stable sort, which means that the relative order of equal elements is the same in the input and output. Merge sort is a divide-and-conquer algorithm that was invented by John von Neumann in 1945. A detailed description and analysis of bottom-up merge sort appeared in a report by Goldstine and von Neumann as early as 1948.

<span class="mw-page-title-main">Floor and ceiling functions</span> Nearest integers from a number

In mathematics and computer science, the floor function is the function that takes as input a real number x, and gives as output the greatest integer less than or equal to x, denoted x or floor(x). Similarly, the ceiling function maps x to the smallest integer greater than or equal to x, denoted x or ceil(x).

<span class="mw-page-title-main">Shellsort</span> Sorting algorithm which uses multiple comparison intervals

Shellsort, also known as Shell sort or Shell's method, is an in-place comparison sort. It can be seen as either a generalization of sorting by exchange or sorting by insertion. The method starts by sorting pairs of elements far apart from each other, then progressively reducing the gap between elements to be compared. By starting with far apart elements, it can move some out-of-place elements into position faster than a simple nearest neighbor exchange. Donald Shell published the first version of this sort in 1959. The running time of Shellsort is heavily dependent on the gap sequence it uses. For many practical variants, determining their time complexity remains an open problem.

In computer science, the Akra–Bazzi method, or Akra–Bazzi theorem, is used to analyze the asymptotic behavior of the mathematical recurrences that appear in the analysis of divide and conquer algorithms where the sub-problems have substantially different sizes. It is a generalization of the master theorem for divide-and-conquer recurrences, which assumes that the sub-problems have equal size. It is named after mathematicians Mohamad Akra and Louay Bazzi.

<span class="mw-page-title-main">Binary logarithm</span> Exponent of a power of two

In mathematics, the binary logarithm is the power to which the number 2 must be raised to obtain the value n. That is, for any real number x,

<span class="mw-page-title-main">Self-balancing binary search tree</span> Any node-based binary search tree that automatically keeps its height the same

In computer science, a self-balancing binary search tree (BST) is any node-based binary search tree that automatically keeps its height small in the face of arbitrary item insertions and deletions. These operations when designed for a self-balancing binary search tree, contain precautionary measures against boundlessly increasing tree height, so that these abstract data structures receive the attribute "self-balancing".

In computer science, a selection algorithm is an algorithm for finding the th smallest value in a collection of ordered values, such as numbers. The value that it finds is called the th order statistic. Selection includes as special cases the problems of finding the minimum, median, and maximum element in the collection. Selection algorithms include quickselect, and the median of medians algorithm. When applied to a collection of values, these algorithms take linear time, as expressed using big O notation. For data that is already structured, faster algorithms may be possible; as an extreme case, selection in an already-sorted array takes time .

<span class="mw-page-title-main">Quicksort</span> Divide and conquer sorting algorithm

Quicksort is an efficient, general-purpose sorting algorithm. Quicksort was developed by British computer scientist Tony Hoare in 1959 and published in 1961. It is still a commonly used algorithm for sorting. Overall, it is slightly faster than merge sort and heapsort for randomized data, particularly on larger distributions.

In computer science, an algorithm is said to be asymptotically optimal if, roughly speaking, for large inputs it performs at worst a constant factor worse than the best possible algorithm. It is a term commonly encountered in computer science research as a result of widespread use of big-O notation.

<span class="mw-page-title-main">Karger's algorithm</span> Randomized algorithm for minimum cuts

In computer science and graph theory, Karger's algorithm is a randomized algorithm to compute a minimum cut of a connected graph. It was invented by David Karger and first published in 1993.

Discrepancy of hypergraphs is an area of discrepancy theory.

Samplesort is a sorting algorithm that is a divide and conquer algorithm often used in parallel processing systems. Conventional divide and conquer sorting algorithms partitions the array into sub-intervals or buckets. The buckets are then sorted individually and then concatenated together. However, if the array is non-uniformly distributed, the performance of these sorting algorithms can be significantly throttled. Samplesort addresses this issue by selecting a sample of size s from the n-element sequence, and determining the range of the buckets by sorting the sample and choosing p−1 < s elements from the result. These elements then divide the array into p approximately equal-sized buckets. Samplesort is described in the 1970 paper, "Samplesort: A Sampling Approach to Minimal Storage Tree Sorting", by W. D. Frazer and A. C. McKellar.

In computational complexity the decision tree model is the model of computation in which an algorithm is considered to be basically a decision tree, i.e., a sequence of queries or tests that are done adaptively, so the outcome of previous tests can influence the tests performed next.

In computer science, an exponential search is an algorithm, created by Jon Bentley and Andrew Chi-Chih Yao in 1976, for searching sorted, unbounded/infinite lists. There are numerous ways to implement this, with the most common being to determine a range that the search key resides in and performing a binary search within that range. This takes O(log i) where i is the position of the search key in the list, if the search key is in the list, or the position where the search key should be, if the search key is not in the list.

In mathematics, the incompressibility method is a proof method like the probabilistic method, the counting method or the pigeonhole principle. To prove that an object in a certain class satisfies a certain property, select an object of that class that is incompressible. If it does not satisfy the property, it can be compressed by computable coding. Since it can be generally proven that almost all objects in a given class are incompressible, the argument demonstrates that almost all objects in the class have the property involved. To select an incompressible object is ineffective, and cannot be done by a computer program. However, a simple counting argument usually shows that almost all objects of a given class can be compressed by only a few bits.

In mathematics and computer science, the sorting numbers are a sequence of numbers introduced in 1950 by Hugo Steinhaus for the analysis of comparison sort algorithms. These numbers give the worst-case number of comparisons used by both binary insertion sort and merge sort. However, there are other algorithms that use fewer comparisons.

In computer science, merge-insertion sort or the Ford–Johnson algorithm is a comparison sorting algorithm published in 1959 by L. R. Ford Jr. and Selmer M. Johnson. It uses fewer comparisons in the worst case than the best previously known algorithms, binary insertion sort and merge sort, and for 20 years it was the sorting algorithm with the fewest known comparisons. Although not of practical significance, it remains of theoretical interest in connection with the problem of sorting with a minimum number of comparisons. The same algorithm may have also been independently discovered by Stanisław Trybuła and Czen Ping.

<span class="mw-page-title-main">Parallel external memory</span>

In computer science, a parallel external memory (PEM) model is a cache-aware, external-memory abstract machine. It is the parallel-computing analogy to the single-processor external memory (EM) model. In a similar way, it is the cache-aware analogy to the parallel random-access machine (PRAM). The PEM model consists of a number of processors, together with their respective private caches and a shared main memory.

References