Funnelsort

Last updated November 02, 2023

Funnelsort is a comparison-based sorting algorithm. It is similar to mergesort, but it is a cache-oblivious algorithm, designed for a setting where the number of elements to sort is too large to fit in a cache where operations are done. It was introduced by Matteo Frigo, Charles Leiserson, Harald Prokop, and Sridhar Ramachandran in 1999 in the context of the cache oblivious model.^[1]^[2]

Mathematical properties

In the external memory model, the number of memory transfers it needs to perform a sort of $N$ items on a machine with cache of size $Z$ and cache lines of length $L$ is $O\left({\tfrac {N}{L}}\log _{Z}N\right)$ , under the tall cache assumption that $Z=\Omega (L^{2})$ . This number of memory transfers has been shown to be asymptotically optimal for comparison sorts. Funnelsort also achieves the asymptotically optimal runtime complexity of $\Theta (N\log N)$ .

Algorithm

Basic overview

Funnelsort operates on a contiguous array of $N$ elements. To sort the elements, it performs the following:

Split the input into $N^{1/3}$ arrays of size $N^{2/3}$ , and sort the arrays recursively.
Merge the $N^{1/3}$ sorted sequences using a $N^{1/3}$ -merger. (This process will be described in more detail.)

Funnelsort is similar to merge sort in that some number of subarrays are recursively sorted, after which a merging step combines the subarrays into one sorted array. Merging is performed by a device called a k-merger, which is described in the section below.

k-mergers

A k-merger takes $k$ sorted sequences. Upon one invocation of a k-merger, it outputs the first $k^{3}$ elements of the sorted sequence obtained by merging the k input sequences.

At the top level, funnelsort uses a $N^{1/3}$ -merger on $N^{1/3}$ sequences of length $N^{2/3}$ , and invokes this merger once.

The k-merger is built recursively out of ${\sqrt {k}}$ -mergers. It consists of ${\sqrt {k}}$ input ${\sqrt {k}}$ -mergers $I_{1},I_{2},\ldots ,I_{\sqrt {k}}$ , and a single output ${\sqrt {k}}$ -merger $O$ . The k inputs are separated into ${\sqrt {k}}$ sets of ${\sqrt {k}}$ inputs each. Each of these sets is an input to one of the input mergers. The output of each input merger is connected to a buffer, a FIFO queue that can hold $2k^{3/2}$ elements. The buffers are implemented as circular queues. The outputs of the ${\sqrt {k}}$ buffers are connected to the inputs of the output merger $O$ . Finally, the output of $O$ is the output of the entire k-merger.

In this construction, any input merger only outputs $k^{3/2}$ items at once, but the buffer it outputs to has double the space. This is done so that an input merger can be called only when its buffer does not have enough items, but that when it is called, it outputs a lot of items at once (namely, $k^{3/2}$ of them).

A k-merger works recursively in the following way. To output $k^{3}$ elements, it recursively invokes its output merger $k^{3/2}$ times. However, before it makes a call to $O$ , it checks all of its buffers, filling each of them that are less than half full. To fill the i-th buffer, it recursively invokes the corresponding input merger $I_{i}$ once. If this cannot be done (due to the merger running out of inputs), this step is skipped. Since this call outputs $k^{3/2}$ elements, the buffer contains at least $k^{3/2}$ elements. At the end of all these operations, the k-merger has output the first $k^{3}$ of its input elements, in sorted order.

Analysis

Most of the analysis of this algorithm revolves around analyzing the space and cache miss complexity of the k-merger.

The first important bound is that a k-merger can be fit in $O(k^{2})$ space. To see this, we let $S(k)$ denote the space needed for a k-merger. To fit the $k^{1/2}$ buffers of size $2k^{3/2}$ takes $O(k^{2})$ space. To fit the ${\sqrt {k}}+1$ smaller buffers takes $({\sqrt {k}}+1)S({\sqrt {k}})$ space. Thus, the space satisfies the recurrence $S(k)=({\sqrt {k}}+1)S({\sqrt {k}})+O(k^{2})$ . This recurrence has solution $S(k)=O(k^{2})$ .

It follows that there is a positive constant $\alpha$ such that a problem of size at most $\alpha {\sqrt {Z}}$ fits entirely in cache, meaning that it incurs no additional cache misses.

Letting $Q_{M}(k)$ denote the number of cache misses incurred by a call to a k-merger, one can show that $Q_{M}(k)=O((k^{3}\log _{Z}k)/L).$ This is done by an induction argument. It has $k\leq \alpha {\sqrt {Z}}$ as a base case. For larger k, we can bound the number of times a ${\sqrt {k}}$ -merger is called. The output merger is called exactly $k^{3/2}$ times. The total number of calls on input mergers is at most $k^{3/2}+2{\sqrt {k}}$ . This gives a total bound of $2k^{3/2}+2{\sqrt {k}}$ recursive calls. In addition, the algorithm checks every buffer to see if needs to be filled. This is done on ${\sqrt {k}}$ buffers every step for $k^{3/2}$ steps, leading to a max of $k^{2}$ cache misses for all the checks.

This leads to the recurrence $Q_{M}(k)\leq (2k^{3/2}+2{\sqrt {k}})Q_{M}({\sqrt {k}})+k^{2}$ , which can be shown to have the solution given above.

Finally, the total cache misses $Q(N)$ for the entire sort can be analyzed. It satisfies the recurrence $Q(N)=N^{1/3}Q(N^{2/3})+Q_{M}(N^{1/3}).$ This can be shown to have solution $Q(N)=O((N/L)\log _{Z}N).$

Lazy funnelsort

Lazy funnelsort is a modification of the funnelsort, introduced by Gerth Stølting Brodal and Rolf Fagerberg in 2002.^[3] The modification is that when a merger is invoked, it does not have to fill each of its buffers. Instead, it lazily fills a buffer only when it is empty. This modification has the same asymptotic runtime and memory transfers as the original funnelsort, but has applications in cache-oblivious algorithms for problems in computational geometry in a method known as distribution sweeping.

Related Research Articles

In mathematics and computer programming, exponentiating by squaring is a general method for fast computation of large positive integer powers of a number, or more generally of an element of a semigroup, like a polynomial or a square matrix. Some variants are commonly referred to as square-and-multiply algorithms or binary exponentiation. These can be of quite general use, for example in modular arithmetic or powering of matrices. For semigroups for which additive notation is commonly used, like elliptic curves used in cryptography, this method is also referred to as double-and-add.

<span class="mw-page-title-main">Merge sort</span> Divide and conquer-based sorting algorithm

In computer science, merge sort is an efficient, general-purpose, and comparison-based sorting algorithm. Most implementations produce a stable sort, which means that the relative order of equal elements is the same in the input and output. Merge sort is a divide-and-conquer algorithm that was invented by John von Neumann in 1945. A detailed description and analysis of bottom-up merge sort appeared in a report by Goldstine and von Neumann as early as 1948.

Merge algorithms are a family of algorithms that take multiple sorted lists as input and produce a single list as output, containing all the elements of the inputs lists in sorted order. These algorithms are used as subroutines in various sorting algorithms, most famously merge sort.

<span class="mw-page-title-main">Sorting algorithm</span> Algorithm that arranges lists in order

In computer science, a sorting algorithm is an algorithm that puts elements of a list into an order. The most frequently used orders are numerical order and lexicographical order, and either ascending or descending. Efficient sorting is important for optimizing the efficiency of other algorithms that require input data to be in sorted lists. Sorting is also often useful for canonicalizing data and for producing human-readable output.

In computer science, divide and conquer is an algorithm design paradigm. A divide-and-conquer algorithm recursively breaks down a problem into two or more sub-problems of the same or related type, until these become simple enough to be solved directly. The solutions to the sub-problems are then combined to give a solution to the original problem.

The Cooley–Tukey algorithm, named after J. W. Cooley and John Tukey, is the most common fast Fourier transform (FFT) algorithm. It re-expresses the discrete Fourier transform (DFT) of an arbitrary composite size $in terms of N 1 smaller DFTs of sizes N 2, recursively, to reduce the computation time to O(N log N) for highly composite N (smooth numbers). Because of the algorithm's importance, specific variants and implementation styles have become known by their own names, as described below.$

In computer science, a selection algorithm is an algorithm for finding the $th smallest value in a collection of ordered values, such as numbers. The value that it finds is called the th order statistic. Selection includes as special cases the problems of finding the minimum, median, and maximum element in the collection. Selection algorithms include quickselect, and the median of medians algorithm. When applied to a collection of values, these algorithms take linear time, as expressed using big O notation. For data that is already structured, faster algorithms may be possible; as an extreme case, selection in an already-sorted array takes time .$

External sorting is a class of sorting algorithms that can handle massive amounts of data. External sorting is required when the data being sorted do not fit into the main memory of a computing device and instead they must reside in the slower external memory, usually a disk drive. Thus, external sorting algorithms are external memory algorithms and thus applicable in the external memory model of computation.

In computing, a cache-oblivious algorithm is an algorithm designed to take advantage of a processor cache without having the size of the cache as an explicit parameter. An optimal cache-oblivious algorithm is a cache-oblivious algorithm that uses the cache optimally. Thus, a cache-oblivious algorithm is designed to perform well, without modification, on multiple machines with different cache sizes, or for a memory hierarchy with different levels of cache having different sizes. Cache-oblivious algorithms are contrasted with explicit loop tiling, which explicitly breaks a problem into blocks that are optimally sized for a given cache.

In computing, external memory algorithms or out-of-core algorithms are algorithms that are designed to process data that are too large to fit into a computer's main memory at once. Such algorithms must be optimized to efficiently fetch and access data stored in slow bulk memory such as hard drives or tape drives, or when memory is on a computer network. External memory algorithms are analyzed in the external memory model.

Bitonic mergesort is a parallel algorithm for sorting. It is also used as a construction method for building a sorting network. The algorithm was devised by Ken Batcher. The resulting sorting networks consist of $comparators and have a delay of, where is the number of items to be sorted. This makes it a popular choice for sorting large numbers of elements on an architecture which itself contains a large number of parallel execution units running in lockstep, such as a typical GPU.$

<span class="mw-page-title-main">Quicksort</span> Divide and conquer sorting algorithm

Quicksort is an efficient, general-purpose sorting algorithm. Quicksort was developed by British computer scientist Tony Hoare in 1959 and published in 1961. It is still a commonly used algorithm for sorting. Overall, it is slightly faster than merge sort and heapsort for randomized data, particularly on larger distributions.

Because matrix multiplication is such a central operation in many numerical algorithms, much work has been invested in making matrix multiplication algorithms efficient. Applications of matrix multiplication in computational problems are found in many fields including scientific computing and pattern recognition and in seemingly unrelated problems such as counting the paths through a graph. Many different algorithms have been designed for multiplying matrices on different types of hardware, including parallel and distributed systems, where the computational work is spread over multiple processors.

Samplesort is a sorting algorithm that is a divide and conquer algorithm often used in parallel processing systems. Conventional divide and conquer sorting algorithms partitions the array into sub-intervals or buckets. The buckets are then sorted individually and then concatenated together. However, if the array is non-uniformly distributed, the performance of these sorting algorithms can be significantly throttled. Samplesort addresses this issue by selecting a sample of size $s$ from the $n$ -element sequence, and determining the range of the buckets by sorting the sample and choosing $p -1 < s$ elements from the result. These elements then divide the array into $p$ approximately equal-sized buckets. Samplesort is described in the 1970 paper, "Samplesort: A Sampling Approach to Minimal Storage Tree Sorting", by W. D. Frazer and A. C. McKellar.

In computer science, a doubly logarithmic tree is a tree where each internal node of height 1, the tree layer above the leaves, has two children, and each internal node of height $has children. Each child of the root contains leaves. The number of children at a node from each leaf to root is 0,2,2,4,16, 256, 65536, ...$

In mathematics, elliptic curve primality testing techniques, or elliptic curve primality proving (ECPP), are among the quickest and most widely used methods in primality proving. It is an idea put forward by Shafi Goldwasser and Joe Kilian in 1986 and turned into an algorithm by A. O. L. Atkin the same year. The algorithm was altered and improved by several collaborators subsequently, and notably by Atkin and François Morain, in 1993. The concept of using elliptic curves in factorization had been developed by H. W. Lenstra in 1985, and the implications for its use in primality testing followed quickly.

The cache-oblivious distribution sort is a comparison-based sorting algorithm. It is similar to quicksort, but it is a cache-oblivious algorithm, designed for a setting where the number of elements to sort is too large to fit in a cache where operations are done. In the external memory model, the number of memory transfers it needs to perform a sort of $items on a machine with cache of size and cache lines of length is, under the tall cache assumption that . This number of memory transfers has been shown to be asymptotically optimal for comparison sorts. This distribution sort also achieves the asymptotically optimal runtime complexity of .$

In computer science, $sorting is the problem of sorting pairs of numbers by their sums. Applications of the problem include transit fare minimisation, VLSI design, and sparse polynomial multiplication. As with comparison sorting and integer sorting more generally, algorithms for this problem can be based only on comparisons of these sums, or on other operations that work only when the inputs are small integers.$

In computer science, an oblivious data structure is a data structure that gives no information about the sequence or pattern of the operations that have been applied except for the final result of the operations.

In computer science, a parallel external memory (PEM) model is a cache-aware, external-memory abstract machine. It is the parallel-computing analogy to the single-processor external memory (EM) model. In a similar way, it is the cache-aware analogy to the parallel random-access machine (PRAM). The PEM model consists of a number of processors, together with their respective private caches and a shared main memory.

References

↑ M. Frigo, C.E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In Proceedings of the 40th IEEE Symposium on Foundations of Computer Science (FOCS 99), pp. 285-297. 1999. Extended abstract at IEEE, at Citeseer.
↑ Harald Prokop. Cache-Oblivious Algorithms. Masters thesis, MIT. 1999.
↑ Brodal, Gerth Stølting; Fagerberg, Rolf (25 June 2002). "Cache Oblivious Distribution Sweeping". Automata, Languages and Programming. Lecture Notes in Computer Science. Vol. 2380. Springer. pp. 426–438. CiteSeerX 10.1.1.117.6837 . doi:10.1007/3-540-45465-9_37. ISBN 978-3-540-43864-9.. See also the longer technical report.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] M. Frigo, C.E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In Proceedings of the 40th IEEE Symposium on Foundations of Computer Science (FOCS 99), pp. 285-297. 1999. Extended abstract at IEEE, at Citeseer.

[2] Harald Prokop. Cache-Oblivious Algorithms. Masters thesis, MIT. 1999.

[3] Brodal, Gerth Stølting; Fagerberg, Rolf (25 June 2002). "Cache Oblivious Distribution Sweeping". Automata, Languages and Programming. Lecture Notes in Computer Science. Vol. 2380. Springer. pp. 426–438. CiteSeerX 10.1.1.117.6837 . doi:10.1007/3-540-45465-9_37. ISBN 978-3-540-43864-9.. See also the longer technical report.

[1]

[2]

[3]