Flashsort

Last updated August 26, 2023

Flashsort is a distribution sorting algorithm showing linear computational complexity $O (n)$ for uniformly distributed data sets and relatively little additional memory requirement. The original work was published in 1998 by Karl-Dietrich Neubert.^[1]

Concept

Flashsort is an efficient in-place implementation of histogram sort, itself a type of bucket sort. It assigns each of the $n$ input elements to one of $m$ buckets, efficiently rearranges the input to place the buckets in the correct order, then sorts each bucket. The original algorithm sorts an input array $A$ as follows:

Using a first pass over the input or a priori knowledge, find the minimum and maximum sort keys.
Linearly divide the range $[A min, A max]$ into $m$ buckets.
Make one pass over the input, counting the number of elements $A i$ which fall into each bucket. (Neubert calls the buckets "classes" and the assignment of elements to their buckets "classification".)
Convert the counts of elements in each bucket to a prefix sum, where $L b$ is the number of elements $A i$ in bucket $b$ or less. ( $L 0 = 0$ and $L m = n$ .)
Rearrange the input so all elements of each bucket $b$ are stored in positions $A i$ where $L b - 1 < i \leq L b$ .
Sort each bucket using insertion sort.

Steps 1–3 and 6 are common to any bucket sort, and can be improved using techniques generic to bucket sorts. In particular, the goal is for the buckets to be of approximately equal size ( $n / m$ elements each),^[1] with the ideal being division into $m$ quantiles. While the basic algorithm is a linear interpolation sort, if the input distribution is known to be non-uniform, a non-linear division will more closely approximate this ideal. Likewise, the final sort can use any of a number of techniques, including a recursive flash sort.

What distinguishes flash sort is step 5: an efficient $O (n)$ in-place algorithm for collecting the elements of each bucket together in the correct relative order using only $m$ words of additional memory.

Memory efficient implementation

The Flashsort rearrangement phase operates in cycles. Elements start out "unclassified", then are moved to the correct bucket and considered "classified". The basic procedure is to choose an unclassified element, find its correct bucket, exchange it with an unclassified element there (which must exist, because we counted the size of each bucket ahead of time), mark it as classified, and then repeat with the just-exchanged unclassified element. Eventually, the element is exchanged with itself and the cycle ends.

The details are easy to understand using two (word-sized) variables per bucket. The clever part is the elimination of one of those variables, allowing twice as many buckets to be used and therefore half as much time spent on the final $O (n 2)$ sorting.

To understand it with two variables per bucket, assume there are two arrays of $m$ additional words: $K b$ is the (fixed) upper limit of bucket $b$ (and $K 0 = 0$ ), while $L b$ is a (movable) index into bucket $b$ , so $K b - 1 \leq L b \leq K b$ .

We maintain the loop invariant that each bucket is divided by $L b$ into an unclassified prefix ( $A i$ for $K b - 1 < i \leq L b$ have yet to be moved to their target buckets) and a classified suffix ( $A i$ for $L b < i \leq K b$ are all in the correct bucket and will not be moved again). Initially $L b = K b$ and all elements are unclassified. As sorting proceeds, the $L b$ are decremented until $L b = K b - 1$ for all $b$ and all elements are classified into the correct bucket.

Each round begins by finding the first incompletely classified bucket $c$ (which has $K c - 1 < L c$ ) and taking the first unclassified element in that bucket $A i$ where $i = K c - 1 + 1$ . (Neubert calls this the "cycle leader".) Copy $A i$ to a temporary variable $t$ and repeat:

Compute the bucket $b$ to which $t$ belongs.
Let $j = L b$ be the location where $t$ will be stored.
Exchange $t$ with $A j$ , i.e. store $t$ in $A j$ while fetching the previous value $A j$ thereby displaced.
Decrement $L b$ to reflect the fact that $A j$ is now correctly classified.
If $j \neq i$ , restart this loop with the new $t$ .
If $j = i$ , this round is over and find a new first unclassified element $A i$ .
When there are no more unclassified elements, the distribution into buckets is complete.

When implemented with two variables per bucket in this way, the choice of each round's starting point $i$ is in fact arbitrary; any unclassified element may be used as a cycle leader. The only requirement is that the cycle leaders can be found efficiently.

Although the preceding description uses $K$ to find the cycle leaders, it is in fact possible to do without it, allowing the entire $m$ -word array to be eliminated. (After the distribution is complete, the bucket boundaries can be found in $L$ .)

Suppose that we have classified all elements up to $i - 1$ , and are considering $A i$ as a potential new cycle leader. It is easy to compute its target bucket $b$ . By the loop invariant, it is classified if $L b < i \leq K b$ , and unclassified if $i$ is outside that range. The first inequality is easy to test, but the second appears to require the value $K b$ .

It turns out that the induction hypothesis that all elements up to $i - 1$ are classified implies that $i \leq K b$ , so it is not necessary to test the second inequality.

Consider the bucket $c$ which position $i$ falls into. That is, $K c - 1 < i \leq K c$ . By the induction hypothesis, all elements below $i$ , which includes all buckets up to $K c - 1 < i$ , are completely classified. I.e. no elements which belong in those buckets remain in the rest of the array. Therefore, it is not possible that $b < c$ .

The only remaining case is $b \geq c$ , which implies $K b \geq K c \geq i$ , Q.E.D.

Incorporating this, the flashsort distribution algorithm begins with $L$ as described above and $i = 1$ . Then proceed:^[1]^[2]

If $i > n$ , the distribution is complete.
Given $A i$ , compute the bucket $b$ to which it belongs.
If i ≤ L_b, then A_i is unclassified. Copy it a temporary variable t and:
- Let $j = L b$ be the location where $t$ will be stored.
- Exchange $t$ with $A j$ , i.e. store $t$ in $A j$ while fetching the previous value $A j$ thereby displaced.
- Decrement $L b$ to reflect the fact that $A j$ is now correctly classified.
- If $j \neq i$ , compute the bucket $b$ to which $t$ belongs and restart this (inner) loop with the new $t$ .
$A i$ is now correctly classified. Increment $i$ and restart the (outer) loop.

While saving memory, Flashsort has the disadvantage that it recomputes the bucket for many already-classified elements. This is already done twice per element (once during the bucket-counting phase and a second time when moving each element), but searching for the first unclassified element requires a third computation for most elements. This could be expensive if buckets are assigned using a more complex formula than simple linear interpolation. A variant reduces the number of computations from almost $3 n$ to at most $2 n + m - 1$ by taking the last unclassified element in an unfinished bucket as cycle leader:

Maintain a variable $c$ identifying the first incompletely-classified bucket. Let $c = 1$ to begin with, and when $c > m$ , the distribution is complete.
Let $i = L c$ . If $i = L c - 1$ , increment $c$ and restart this loop. ( $L 0 = 0$ .)
Compute the bucket $b$ to which $A i$ belongs.
If $b < c$ , then $L c = K c - 1$ and we are done with bucket $c$ . Increment $c$ and restart this loop.
If $b = c$ , the classification is trivial. Decrement $L c$ and restart this loop.
If $b > c$ , then $A i$ is unclassified. Perform the same classification loop as the previous case, then restart this loop.

Most elements have their buckets computed only twice, except for the final element in each bucket, which is used to detect the completion of the following bucket. A small further reduction can be achieved by maintaining a count of unclassified elements and stopping when it reaches zero.

Performance

The only extra memory requirements are the auxiliary vector $L$ for storing bucket bounds and the constant number of other variables used. Further, each element is moved (via a temporary buffer, so two move operations) only once. However, this memory efficiency comes with the disadvantage that the array is accessed randomly, so cannot take advantage of a data cache smaller than the whole array.

As with all bucket sorts, performance depends critically on the balance of the buckets. In the ideal case of a balanced data set, each bucket will be approximately the same size. If the number $m$ of buckets is linear in the input size $n$ , each bucket has a constant size, so sorting a single bucket with an $O (n 2)$ algorithm like insertion sort has complexity $O (1 2) = O (1)$ . The running time of the final insertion sorts is therefore $m \cdot O(1) = O (m) = O (n)$ .

Choosing a value for $m$ , the number of buckets, trades off time spent classifying elements (high $m$ ) and time spent in the final insertion sort step (low $m$ ). For example, if $m$ is chosen proportional to $\sqrt n$ , then the running time of the final insertion sorts is therefore $m \cdot O(\sqrt n 2) = O (n 3/2)$ .

In the worst-case scenarios where almost all the elements are in a few buckets, the complexity of the algorithm is limited by the performance of the final bucket-sorting method, so degrades to $O (n 2)$ . Variations of the algorithm improve worst-case performance by using better-performing sorts such as quicksort or recursive flashsort on buckets which exceed a certain size limit.^[2]^[3]

For $m = 0.1 n$ with uniformly distributed random data, flashsort is faster than heapsort for all $n$ and faster than quicksort for $n > 80$ . It becomes about twice as fast as quicksort at $n = 10000$ .^[1] Note that these measurements were taken in the late 1990s, when memory hierarchies were much less dependent on cacheing.

Due to the in situ permutation that flashsort performs in its classification process, flashsort is not stable. If stability is required, it is possible to use a second array so elements can be classified sequentially. However, in this case, the algorithm will require $O (n)$ additional memory.

Related Research Articles

<span class="mw-page-title-main">Heapsort</span> A sorting algorithm which uses the heap data structure

In computer science, heapsort is a comparison-based sorting algorithm. Heapsort can be thought of as an improved selection sort: like selection sort, heapsort divides its input into a sorted and an unsorted region, and it iteratively shrinks the unsorted region by extracting the largest element from it and inserting it into the sorted region. Unlike selection sort, heapsort does not waste time with a linear-time scan of the unsorted region; rather, heap sort maintains the unsorted region in a heap data structure to more quickly find the largest element in each step.

<span class="mw-page-title-main">Insertion sort</span> Sorting algorithm

Insertion sort is a simple sorting algorithm that builds the final sorted array (or list) one item at a time by comparisons. It is much less efficient on large lists than more advanced algorithms such as quicksort, heapsort, or merge sort. However, insertion sort provides several advantages:

<span class="mw-page-title-main">Merge sort</span> Divide and conquer-based sorting algorithm

In computer science, merge sort is an efficient, general-purpose, and comparison-based sorting algorithm. Most implementations produce a stable sort, which means that the relative order of equal elements is the same in the input and output. Merge sort is a divide-and-conquer algorithm that was invented by John von Neumann in 1945. A detailed description and analysis of bottom-up merge sort appeared in a report by Goldstine and von Neumann as early as 1948.

In computer science, radix sort is a non-comparative sorting algorithm. It avoids comparison by creating and distributing elements into buckets according to their radix. For elements with more than one significant digit, this bucketing process is repeated for each digit, while preserving the ordering of the prior step, until all digits have been considered. For this reason, radix sort has also been called bucket sort and digital sort.

<span class="mw-page-title-main">Sorting algorithm</span> Algorithm that arranges lists in order

In computer science, a sorting algorithm is an algorithm that puts elements of a list into an order. The most frequently used orders are numerical order and lexicographical order, and either ascending or descending. Efficient sorting is important for optimizing the efficiency of other algorithms that require input data to be in sorted lists. Sorting is also often useful for canonicalizing data and for producing human-readable output.

In computer science, best, worst, and average cases of a given algorithm express what the resource usage is at least, at most and on average, respectively. Usually the resource being considered is running time, i.e. time complexity, but could also be memory or some other resource. Best case is the function which performs the minimum number of steps on input data of n elements. Worst case is the function which performs the maximum number of steps on input data of size n. Average case is the function which performs an average number of steps on input data of n elements.

Bucket sort, or bin sort, is a sorting algorithm that works by distributing the elements of an array into a number of buckets. Each bucket is then sorted individually, either using a different sorting algorithm, or by recursively applying the bucket sorting algorithm. It is a distribution sort, a generalization of pigeonhole sort that allows multiple keys per bucket, and is a cousin of radix sort in the most-to-least significant digit flavor. Bucket sort can be implemented with comparisons and therefore can also be considered a comparison sort algorithm. The computational complexity depends on the algorithm used to sort each bucket, the number of buckets to use, and whether the input is uniformly distributed.

In computer science, counting sort is an algorithm for sorting a collection of objects according to keys that are small positive integers; that is, it is an integer sorting algorithm. It operates by counting the number of objects that possess distinct key values, and applying prefix sum on those counts to determine the positions of each key value in the output sequence. Its running time is linear in the number of items and the difference between the maximum key value and the minimum key value, so it is only suitable for direct use in situations where the variation in keys is not significantly greater than the number of items. It is often used as a subroutine in radix sort, another sorting algorithm, which can handle larger keys more efficiently.

<span class="mw-page-title-main">Cocktail shaker sort</span>

Cocktail shaker sort, also known as bidirectional bubble sort, cocktail sort, shaker sort, ripple sort, shuffle sort, or shuttle sort, is an extension of bubble sort. The algorithm extends bubble sort by operating in two directions. While it improves on bubble sort by more quickly moving items to the beginning of the list, it provides only marginal performance improvements.

Introsort or introspective sort is a hybrid sorting algorithm that provides both fast average performance and (asymptotically) optimal worst-case performance. It begins with quicksort, it switches to heapsort when the recursion depth exceeds a level based on (the logarithm of) the number of elements being sorted and it switches to insertion sort when the number of elements is below some threshold. This combines the good parts of the three algorithms, with practical performance comparable to quicksort on typical data sets and worst-case O(n log n) runtime due to the heap sort. Since the three algorithms it uses are comparison sorts, it is also a comparison sort.

A randomized algorithm is an algorithm that employs a degree of randomness as part of its logic or procedure. The algorithm typically uses uniformly random bits as an auxiliary input to guide its behavior, in the hope of achieving good performance in the "average case" over all possible choices of random determined by the random bits; thus either the running time, or the output are random variables.

In computing, a Las Vegas algorithm is a randomized algorithm that always gives correct results; that is, it always produces the correct result or it informs about the failure. However, the runtime of a Las Vegas algorithm differs depending on the input. The usual definition of a Las Vegas algorithm includes the restriction that the expected runtime be finite, where the expectation is carried out over the space of random information, or entropy, used in the algorithm. An alternative definition requires that a Las Vegas algorithm always terminates, but may output a symbol not part of the solution space to indicate failure in finding a solution. The nature of Las Vegas algorithms makes them suitable in situations where the number of possible solutions is limited, and where verifying the correctness of a candidate solution is relatively easy while finding a solution is complex.

Library sort, or gapped insertion sort is a sorting algorithm that uses an insertion sort, but with gaps in the array to accelerate subsequent insertions. The name comes from an analogy:

Suppose a librarian were to store their books alphabetically on a long shelf, starting with the As at the left end, and continuing to the right along the shelf with no spaces between the books until the end of the Zs. If the librarian acquired a new book that belongs to the B section, once they find the correct space in the B section, they will have to move every book over, from the middle of the Bs all the way down to the Zs in order to make room for the new book. This is an insertion sort. However, if they were to leave a space after every letter, as long as there was still space after B, they would only have to move a few books to make room for the new one. This is the basic principle of the Library Sort.

<span class="mw-page-title-main">Quicksort</span> Divide and conquer sorting algorithm

Quicksort is an efficient, general-purpose sorting algorithm. Quicksort was developed by British computer scientist Tony Hoare in 1959 and published in 1961. It is still a commonly used algorithm for sorting. Overall, it is slightly faster than merge sort and heapsort for randomized data, particularly on larger distributions.

sort is a generic function in the C++ Standard Library for doing comparison sorting. The function originated in the Standard Template Library (STL).

Spreadsort is a sorting algorithm invented by Steven J. Ross in 2002. It combines concepts from distribution-based sorts, such as radix sort and bucket sort, with partitioning concepts from comparison sorts such as quicksort and mergesort. In experimental results it was shown to be highly efficient, often outperforming traditional algorithms such as quicksort, particularly on distributions exhibiting structure and string sorting. There is an open-source implementation with performance analysis and benchmarks, and HTML documentation .

Samplesort is a sorting algorithm that is a divide and conquer algorithm often used in parallel processing systems. Conventional divide and conquer sorting algorithms partitions the array into sub-intervals or buckets. The buckets are then sorted individually and then concatenated together. However, if the array is non-uniformly distributed, the performance of these sorting algorithms can be significantly throttled. Samplesort addresses this issue by selecting a sample of size $s$ from the $n$ -element sequence, and determining the range of the buckets by sorting the sample and choosing $p -1 < s$ elements from the result. These elements then divide the array into $p$ approximately equal-sized buckets. Samplesort is described in the 1970 paper, "Samplesort: A Sampling Approach to Minimal Storage Tree Sorting", by W. D. Frazer and A. C. McKellar.

<span class="mw-page-title-main">Bubble sort</span> Simple comparison sorting algorithm

Bubble sort, sometimes referred to as sinking sort, is a simple sorting algorithm that repeatedly steps through the input list element by element, comparing the current element with the one after it, swapping their values if needed. These passes through the list are repeated until no swaps had to be performed during a pass, meaning that the list has become fully sorted. The algorithm, which is a comparison sort, is named for the way the larger elements "bubble" up to the top of the list.

The cache-oblivious distribution sort is a comparison-based sorting algorithm. It is similar to quicksort, but it is a cache-oblivious algorithm, designed for a setting where the number of elements to sort is too large to fit in a cache where operations are done. In the external memory model, the number of memory transfers it needs to perform a sort of $items on a machine with cache of size and cache lines of length is, under the tall cache assumption that . This number of memory transfers has been shown to be asymptotically optimal for comparison sorts. This distribution sort also achieves the asymptotically optimal runtime complexity of .$

In computer science, a parallel external memory (PEM) model is a cache-aware, external-memory abstract machine. It is the parallel-computing analogy to the single-processor external memory (EM) model. In a similar way, it is the cache-aware analogy to the parallel random-access machine (PRAM). The PEM model consists of a number of processors, together with their respective private caches and a shared main memory.

References

1 2 3 4 Neubert, Karl-Dietrich (February 1998). "The Flashsort1 Algorithm". Dr. Dobb's Journal. 23 (2): 123–125, 131. Retrieved 2007-11-06.
1 2 Neubert, Karl-Dietrich (1998). "The FlashSort Algorithm" . Retrieved 2007-11-06.
↑ Xiao, Li; Zhang, Xiaodong; Kubricht, Stefan A. (2000). "Improving Memory Performance of Sorting Algorithms: Cache-Effective Quicksort". ACM Journal of Experimental Algorithmics. 5. CiteSeerX 10.1.1.43.736 . doi:10.1145/351827.384245. Archived from the original on 2007-11-02. Retrieved 2007-11-06.

External links

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[neubert_journal-1] 1 2 3 4 Neubert, Karl-Dietrich (February 1998). "The Flashsort1 Algorithm". Dr. Dobb's Journal. 23 (2): 123–125, 131. Retrieved 2007-11-06.

[neubert_code-2] 1 2 Neubert, Karl-Dietrich (1998). "The FlashSort Algorithm" . Retrieved 2007-11-06.

[3] Xiao, Li; Zhang, Xiaodong; Kubricht, Stefan A. (2000). "Improving Memory Performance of Sorting Algorithms: Cache-Effective Quicksort". ACM Journal of Experimental Algorithmics. 5. CiteSeerX 10.1.1.43.736 . doi:10.1145/351827.384245. Archived from the original on 2007-11-02. Retrieved 2007-11-06.

[1]

[2]

[3]

v t e Sorting algorithms
Theory	Computational complexity theory Big O notation Total order Lists Inplacement Stability Comparison sort Adaptive sort Sorting network Integer sorting X + Y sorting Transdichotomous model Quantum sort
Exchange sorts	Bubble sort Cocktail shaker sort Odd–even sort Comb sort Gnome sort Proportion extend sort Quicksort
Selection sorts	Selection sort Heapsort Smoothsort Cartesian tree sort Tournament sort Cycle sort Weak-heap sort
Insertion sorts	Insertion sort Shellsort Splaysort Tree sort Library sort Patience sorting
Merge sorts	Merge sort Cascade merge sort Oscillating merge sort Polyphase merge sort
Distribution sorts	American flag sort Bead sort Bucket sort Burstsort Counting sort Interpolation sort Pigeonhole sort Proxmap sort Radix sort Flashsort
Concurrent sorts	Bitonic sorter Batcher odd–even mergesort Pairwise sorting network Samplesort
Hybrid sorts	Block merge sort Kirkpatrick–Reisch sort Timsort Introsort Spreadsort Merge-insertion sort
Other	Topological sorting Pre-topological order Pancake sorting Spaghetti sort
Impractical sorts	Stooge sort Slowsort Bogosort