Integer sorting

Last updated

In computer science, integer sorting is the algorithmic problem of sorting a collection of data values by integer keys. Algorithms designed for integer sorting may also often be applied to sorting problems in which the keys are floating point numbers, rational numbers, or text strings. [1] The ability to perform integer arithmetic on the keys allows integer sorting algorithms to be faster than comparison sorting algorithms in many cases, depending on the details of which operations are allowed in the model of computing and how large the integers to be sorted are.

Contents

Integer sorting algorithms including pigeonhole sort, counting sort, and radix sort are widely used and practical. Other integer sorting algorithms with smaller worst-case time bounds are not believed to be practical for computer architectures with 64 or fewer bits per word. Many such algorithms are known, with performance depending on a combination of the number of items to be sorted, number of bits per key, and number of bits per word of the computer performing the sorting algorithm.

General considerations

Models of computation

Time bounds for integer sorting algorithms typically depend on three parameters: the number n of data values to be sorted, the magnitude K of the largest possible key to be sorted, and the number w of bits that can be represented in a single machine word of the computer on which the algorithm is to be performed. Typically, it is assumed that w ≥ log2(max(n, K)); that is, that machine words are large enough to represent an index into the sequence of input data, and also large enough to represent a single key. [2]

Integer sorting algorithms are usually designed to work in either the pointer machine or random access machine models of computing. The main difference between these two models is in how memory may be addressed. The random access machine allows any value that is stored in a register to be used as the address of memory read and write operations, with unit cost per operation. This ability allows certain complex operations on data to be implemented quickly using table lookups. In contrast, in the pointer machine model, read and write operations use addresses stored in pointers, and it is not allowed to perform arithmetic operations on these pointers. In both models, data values may be added, and bitwise Boolean operations and binary shift operations may typically also be performed on them, in unit time per operation. Different integer sorting algorithms make different assumptions, however, about whether integer multiplication is also allowed as a unit-time operation. [3] Other more specialized models of computation such as the parallel random access machine have also been considered. [4]

Andersson, Miltersen & Thorup (1999) showed that in some cases the multiplications or table lookups required by some integer sorting algorithms could be replaced by customized operations that would be more easily implemented in hardware but that are not typically available on general-purpose computers. Thorup (2003) improved on this by showing how to replace these special operations by the bit field manipulation instructions already available on Pentium processors.

In external memory models of computing, no known integer sorting algorithm is faster than comparison sorting. Researchers have shown that, in these models, restricted classes of algorithms that are limited in how they manipulate their keys cannot be faster than comparison sorting, [5] and that an integer sorting algorithm that is faster than comparison sorting would imply the falsity of a standard conjecture in network coding. [6]

Sorting versus integer priority queues

A priority queue is a data structure for maintaining a collection of items with numerical priorities, having operations for finding and removing the item with the minimum priority value. Comparison-based priority queues such as the binary heap take logarithmic time per update, but other structures such as the van Emde Boas tree or bucket queue may be faster for inputs whose priorities are small integers. These data structures can be used in the selection sort algorithm, which sorts a collection of elements by repeatedly finding and removing the smallest element from the collection, and returning the elements in the order they were found. A priority queue can be used to maintain the collection of elements in this algorithm, and the time for this algorithm on a collection of n elements can be bounded by the time to initialize the priority queue and then to perform n find and remove operations. For instance, using a binary heap as a priority queue in selection sort leads to the heap sort algorithm, a comparison sorting algorithm that takes O(n log n) time. Instead, using selection sort with a bucket queue gives a form of pigeonhole sort, and using van Emde Boas trees or other integer priority queues leads to other fast integer sorting algorithms. [7]

Instead of using an integer priority queue in a sorting algorithm, it is possible to go the other direction, and use integer sorting algorithms as subroutines within an integer priority queue data structure. Thorup (2007) used this idea to show that, if it is possible to perform integer sorting in time T(n) per key, then the same time bound applies to the time per insertion or deletion operation in a priority queue data structure. Thorup's reduction is complicated and assumes the availability of either fast multiplication operations or table lookups, but he also provides an alternative priority queue using only addition and Boolean operations with time T(n) + T(log n) + T(log log n) + ... per operation, at most multiplying the time by an iterated logarithm. [7]

Usability

The classical integer sorting algorithms of pigeonhole sort, counting sort, and radix sort are widely used and practical. [8] Much of the subsequent research on integer sorting algorithms has focused less on practicality and more on theoretical improvements in their worst case analysis, and the algorithms that come from this line of research are not believed to be practical for current 64-bit computer architectures, although experiments have shown that some of these methods may be an improvement on radix sorting for data with 128 or more bits per key. [9] Additionally, for large data sets, the near-random memory access patterns of many integer sorting algorithms can handicap them compared to comparison sorting algorithms that have been designed with the memory hierarchy in mind. [10]

Integer sorting provides one of the six benchmarks in the DARPA High Productivity Computing Systems Discrete Mathematics benchmark suite, [11] and one of eleven benchmarks in the NAS Parallel Benchmarks suite.

Practical algorithms

Pigeonhole sort or counting sort can both sort n data items having keys in the range from 0 to K 1 in time O(n + K). In pigeonhole sort (often called bucket sort), pointers to the data items are distributed to a table of buckets, represented as collection data types such as linked lists, using the keys as indices into the table. Then, all of the buckets are concatenated together to form the output list. [12] Counting sort uses a table of counters in place of a table of buckets, to determine the number of items with each key. Then, a prefix sum computation is used to determine the range of positions in the sorted output at which the values with each key should be placed. Finally, in a second pass over the input, each item is moved to its key's position in the output array. [13] Both algorithms involve only simple loops over the input data (taking time O(n)) and over the set of possible keys (taking time O(K)), giving their O(n + K) overall time bound.

Radix sort is a sorting algorithm that works for larger keys than pigeonhole sort or counting sort by performing multiple passes over the data. Each pass sorts the input using only part of the keys, by using a different sorting algorithm (such as pigeonhole sort or counting sort) that is suited only for small keys. To break the keys into parts, the radix sort algorithm computes the positional notation for each key, according to some chosen radix; then, the part of the key used for the ith pass of the algorithm is the ith digit in the positional notation for the full key, starting from the least significant digit and progressing to the most significant. For this algorithm to work correctly, the sorting algorithm used in each pass over the data must be stable: items with equal digits should not change positions with each other. For greatest efficiency, the radix should be chosen to be near the number of data items, n. Additionally, using a power of two near n as the radix allows the keys for each pass to be computed quickly using only fast binary shift and mask operations. With these choices, and with pigeonhole sort or counting sort as the base algorithm, the radix sorting algorithm can sort n data items having keys in the range from 0 to K 1 in time O(n lognK). [14]

Theoretical algorithms

Many integer sorting algorithms have been developed whose theoretical analysis shows them to behave better than comparison sorting, pigeonhole sorting, or radix sorting for large enough combinations of the parameters defining the number of items to be sorted, range of keys, and machine word size. Which algorithm has the best performance depends on the values of these parameters. However, despite their theoretical advantages, these algorithms are not an improvement for the typical ranges of these parameters that arise in practical sorting problems. [9]

Algorithms for small keys

A Van Emde Boas tree may be used as a priority queue to sort a set of n keys, each in the range from 0 to K 1, in time O(n log log K). This is a theoretical improvement over radix sorting when K is sufficiently large. However, in order to use a Van Emde Boas tree, one either needs a directly addressable memory of K words, or one needs to simulate it using a hash table, reducing the space to linear but making the algorithm randomized. Another priority queue with similar performance (including the need for randomization in the form of hash tables) is the Y-fast trie of Willard (1983).

A more sophisticated technique with a similar flavor and with better theoretical performance was developed by Kirkpatrick & Reisch (1984). They observed that each pass of radix sort can be interpreted as a range reduction technique that, in linear time, reduces the maximum key size by a factor of n; instead, their technique reduces the key size to the square root of its previous value (halving the number of bits needed to represent a key), again in linear time. As in radix sort, they interpret the keys as two-digit base-b numbers for a base b that is approximately K. They then group the items to be sorted into buckets according to their high digits, in linear time, using either a large but uninitialized direct addressed memory or a hash table. Each bucket has a representative, the item in the bucket with the largest key; they then sort the list of items using as keys the high digits for the representatives and the low digits for the non-representatives. By grouping the items from this list into buckets again, each bucket may be placed into sorted order, and by extracting the representatives from the sorted list the buckets may be concatenated together into sorted order. Thus, in linear time, the sorting problem is reduced to another recursive sorting problem in which the keys are much smaller, the square root of their previous magnitude. Repeating this range reduction until the keys are small enough to bucket sort leads to an algorithm with running time O(n log lognK).

A complicated randomized algorithm of Han & Thorup (2002) in the word RAM model of computation allows these time bounds to be reduced even farther, to O(nlog log K).

Algorithms for large words

An integer sorting algorithm is said to be non-conservative if it requires a word size w that is significantly larger than log max(n, K). [15] As an extreme instance, if wK, and all keys are distinct, then the set of keys may be sorted in linear time by representing it as a bitvector, with a 1 bit in position i when i is one of the input keys, and then repeatedly removing the least significant bit. [16]

The non-conservative packed sorting algorithm of Albers & Hagerup (1997) uses a subroutine, based on Ken Batcher's bitonic sorting network, for merging two sorted sequences of keys that are each short enough to be packed into a single machine word. The input to the packed sorting algorithm, a sequence of items stored one per word, is transformed into a packed form, a sequence of words each holding multiple items in sorted order, by using this subroutine repeatedly to double the number of items packed into each word. Once the sequence is in packed form, Albers and Hagerup use a form of merge sort to sort it; when two sequences are being merged to form a single longer sequence, the same bitonic sorting subroutine can be used to repeatedly extract packed words consisting of the smallest remaining elements of the two sequences. This algorithm gains enough of a speedup from its packed representation to sort its input in linear time whenever it is possible for a single word to contain Ω(log n log log n) keys; that is, when log K log n log log ncw for some constant c > 0.

Algorithms for few items

Pigeonhole sort, counting sort, radix sort, and Van Emde Boas tree sorting all work best when the key size is small; for large enough keys, they become slower than comparison sorting algorithms. However, when the key size or the word size is very large relative to the number of items (or equivalently when the number of items is small), it may again become possible to sort quickly, using different algorithms that take advantage of the parallelism inherent in the ability to perform arithmetic operations on large words.

An early result in this direction was provided by Ajtai, Fredman & Komlós (1984) using the cell-probe model of computation (an artificial model in which the complexity of an algorithm is measured only by the number of memory accesses it performs). Building on their work, Fredman & Willard (1994) described two data structures, the Q-heap and the atomic heap, that are implementable on a random access machine. The Q-heap is a bit-parallel version of a binary trie, and allows both priority queue operations and successor and predecessor queries to be performed in constant time for sets of O((log N)1/4) items, where N ≤ 2w is the size of the precomputed tables needed to implement the data structure. The atomic heap is a B-tree in which each tree node is represented as a Q-heap; it allows constant time priority queue operations (and therefore sorting) for sets of (log N)O(1) items.

Andersson et al. (1998) provide a randomized algorithm called signature sort that allows for linear time sorting of sets of up to 2O((log w)1/2 ε) items at a time, for any constant ε > 0. As in the algorithm of Kirkpatrick and Reisch, they perform range reduction using a representation of the keys as numbers in base b for a careful choice of b. Their range reduction algorithm replaces each digit by a signature, which is a hashed value with O(log n) bits such that different digit values have different signatures. If n is sufficiently small, the numbers formed by this replacement process will be significantly smaller than the original keys, allowing the non-conservative packed sorting algorithm of Albers & Hagerup (1997) to sort the replaced numbers in linear time. From the sorted list of replaced numbers, it is possible to form a compressed trie of the keys in linear time, and the children of each node in the trie may be sorted recursively using only keys of size b, after which a tree traversal produces the sorted order of the items.

Trans-dichotomous algorithms

Fredman & Willard (1993) introduced the transdichotomous model of analysis for integer sorting algorithms, in which nothing is assumed about the range of the integer keys and one must bound the algorithm's performance by a function of the number of data values alone. Alternatively, in this model, the running time for an algorithm on a set of n items is assumed to be the worst case running time for any possible combination of values of K and w. The first algorithm of this type was Fredman and Willard's fusion tree sorting algorithm, which runs in time O(n log n / log log n); this is an improvement over comparison sorting for any choice of K and w. An alternative version of their algorithm that includes the use of random numbers and integer division operations improves this to O(nlog n).

Since their work, even better algorithms have been developed. For instance, by repeatedly applying the Kirkpatrick–Reisch range reduction technique until the keys are small enough to apply the Albers–Hagerup packed sorting algorithm, it is possible to sort in time O(n log log n); however, the range reduction part of this algorithm requires either a large memory (proportional to K) or randomization in the form of hash tables. [17]

Han & Thorup (2002) showed how to sort in randomized time O(nlog log n). Their technique involves using ideas related to signature sorting to partition the data into many small sublists, of a size small enough that signature sorting can sort each of them efficiently. It is also possible to use similar ideas to sort integers deterministically in time O(n log log n) and linear space. [18] Using only simple arithmetic operations (no multiplications or table lookups) it is possible to sort in randomized expected time O(n log log n) [19] or deterministically in time O(n (log log n)1 + ε) for any constant ε > 0. [1]

Related Research Articles

<span class="mw-page-title-main">Heap (data structure)</span> Computer science data structure

In computer science, a heap is a specialized tree-based data structure that satisfies the heap property: In a max heap, for any given node C, if P is a parent node of C, then the key of P is greater than or equal to the key of C. In a min heap, the key of P is less than or equal to the key of C. The node at the "top" of the heap is called the root node.

In computer science, a priority queue is an abstract data-type similar to a regular queue or stack data structure. Each element in a priority queue has an associated priority. In a priority queue, elements with high priority are served before elements with low priority. In some implementations, if two elements have the same priority, they are served in the same order in which they were enqueued. In other implementations, the order of elements with the same priority is undefined.

Pigeonhole sorting is a sorting algorithm that is suitable for sorting lists of elements where the number n of elements and the length N of the range of possible key values are approximately the same. It requires O(n + N) time. It is similar to counting sort, but differs in that it "moves items twice: once to the bucket array and again to the final destination [whereas] counting sort builds an auxiliary array then uses the array to compute each item's final destination and move the item there."

In computer science, radix sort is a non-comparative sorting algorithm. It avoids comparison by creating and distributing elements into buckets according to their radix. For elements with more than one significant digit, this bucketing process is repeated for each digit, while preserving the ordering of the prior step, until all digits have been considered. For this reason, radix sort has also been called bucket sort and digital sort.

<span class="mw-page-title-main">Sorting algorithm</span> Algorithm that arranges lists in order

In computer science, a sorting algorithm is an algorithm that puts elements of a list into an order. The most frequently used orders are numerical order and lexicographical order, and either ascending or descending. Efficient sorting is important for optimizing the efficiency of other algorithms that require input data to be in sorted lists. Sorting is also often useful for canonicalizing data and for producing human-readable output.

<span class="mw-page-title-main">Dijkstra's algorithm</span> Graph search algorithm

Dijkstra's algorithm is an algorithm for finding the shortest paths between nodes in a weighted graph, which may represent, for example, road networks. It was conceived by computer scientist Edsger W. Dijkstra in 1956 and published three years later.

<span class="mw-page-title-main">Binary heap</span> Variant of heap data structure

A binary heap is a heap data structure that takes the form of a binary tree. Binary heaps are a common way of implementing priority queues. The binary heap was introduced by J. W. J. Williams in 1964, as a data structure for heapsort.

<span class="mw-page-title-main">Bucket sort</span> Sorting algorithm

Bucket sort, or bin sort, is a sorting algorithm that works by distributing the elements of an array into a number of buckets. Each bucket is then sorted individually, either using a different sorting algorithm, or by recursively applying the bucket sorting algorithm. It is a distribution sort, a generalization of pigeonhole sort that allows multiple keys per bucket, and is a cousin of radix sort in the most-to-least significant digit flavor. Bucket sort can be implemented with comparisons and therefore can also be considered a comparison sort algorithm. The computational complexity depends on the algorithm used to sort each bucket, the number of buckets to use, and whether the input is uniformly distributed.

In computer science, counting sort is an algorithm for sorting a collection of objects according to keys that are small positive integers; that is, it is an integer sorting algorithm. It operates by counting the number of objects that possess distinct key values, and applying prefix sum on those counts to determine the positions of each key value in the output sequence. Its running time is linear in the number of items and the difference between the maximum key value and the minimum key value, so it is only suitable for direct use in situations where the variation in keys is not significantly greater than the number of items. It is often used as a subroutine in radix sort, another sorting algorithm, which can handle larger keys more efficiently.

<span class="mw-page-title-main">Perfect hash function</span> Hash function without any collisions

In computer science, a perfect hash functionh for a set S is a hash function that maps distinct elements in S to a set of m integers, with no collisions. In mathematical terms, it is an injective function.

In computer science, a selection algorithm is an algorithm for finding the th smallest value in a collection of ordered values, such as numbers. The value that it finds is called the th order statistic. Selection includes as special cases the problems of finding the minimum, median, and maximum element in the collection. Selection algorithms include quickselect, and the median of medians algorithm. When applied to a collection of values, these algorithms take linear time, as expressed using big O notation. For data that is already structured, faster algorithms may be possible; as an extreme case, selection in an already-sorted array takes time .

In computer science, a fusion tree is a type of tree data structure that implements an associative array on w-bit integers on a finite universe, where each of the input integers has size less than 2w and is non-negative. When operating on a collection of n key–value pairs, it uses O(n) space and performs searches in O(logwn) time, which is asymptotically faster than a traditional self-balancing binary search tree, and also better than the van Emde Boas tree for large values of w. It achieves this speed by using certain constant-time operations that can be done on a machine word. Fusion trees were invented in 1990 by Michael Fredman and Dan Willard.

In computer science, the AF-heap is a type of priority queue for integer data, an extension of the fusion tree using an atomic heap proposed by M. L. Fredman and D. E. Willard.

A pairing heap is a type of heap data structure with relatively simple implementation and excellent practical amortized performance, introduced by Michael Fredman, Robert Sedgewick, Daniel Sleator, and Robert Tarjan in 1986. Pairing heaps are heap-ordered multiway tree structures, and can be considered simplified Fibonacci heaps. They are considered a "robust choice" for implementing such algorithms as Prim's MST algorithm, and support the following operations :

Dan Edward Willard was an American computer scientist and logician, and a professor of computer science at the University at Albany.

In computer science, a monotone priority queue is a variant of the priority queue abstract data type in which the priorities of extracted items are required to form a monotonic sequence. That is, for a priority queue in which each successively extracted item is the one with the minimum priority, the minimum priority should be monotonically increasing. Conversely for a max-heap the maximum priority should be monotonically decreasing. The assumption of monotonicity arises naturally in several applications of priority queues, and can be used as a simplifying assumption to speed up certain types of priority queues.

<span class="mw-page-title-main">Bucket queue</span> Data structure for integer priorities

A bucket queue is a data structure that implements the priority queue abstract data type: it maintains a dynamic collection of elements with numerical priorities and allows quick access to the element with minimum priority. In the bucket queue, the priorities must be integers, and it is particularly suited to applications in which the priorities have a small range. A bucket queue has the form of an array of buckets: an array data structure, indexed by the priorities, whose cells contain collections of items with the same priority as each other. With this data structure, insertion of elements and changes of their priority take constant time. Searching for and removing the minimum-priority element takes time proportional to the number of buckets or, by maintaining a pointer to the most recently found bucket, in time proportional to the difference in priorities between successive operations.

In theoretical computer science, the word RAM model is a model of computation in which a random-access machine does arithmetic and bitwise operations on a word of w bits. Michael Fredman and Dan Willard created it in 1990 to simulate programming languages like C.

In computer science, the predecessor problem involves maintaining a set of items to, given an element, efficiently query which element precedes or succeeds that element in an order. Data structures used to solve the problem include balanced binary search trees, van Emde Boas trees, and fusion trees. In the static predecessor problem, the set of elements does not change, but in the dynamic predecessor problem, insertions into and deletions from the set are allowed.

References

Footnotes
  1. 1 2 Han & Thorup (2002).
  2. Fredman & Willard (1993).
  3. The question of whether integer multiplication or table lookup operations should be permitted goes back to Fredman & Willard (1993); see also Andersson, Miltersen & Thorup (1999).
  4. Reif (1985); comment in Cole & Vishkin (1986); Hagerup (1987); Bhatt et al. (1991); Albers & Hagerup (1997).
  5. Aggarwal & Vitter (1988).
  6. Farhadi et al. (2020).
  7. 1 2 Chowdhury (2008).
  8. McIlroy, Bostic & McIlroy (1993); Andersson & Nilsson (1998).
  9. 1 2 Rahman & Raman (1998).
  10. Pedersen (1999).
  11. DARPA HPCS Discrete Mathematics Benchmarks Archived 2016-03-10 at the Wayback Machine , Duncan A. Buell, University of South Carolina, retrieved 2011-04-20.
  12. Goodrich & Tamassia (2002). Although Cormen et al. (2001) also describe a version of this sorting algorithm, the version they describe is adapted to inputs where the keys are real numbers with a known distribution, rather than to integer sorting.
  13. Cormen et al. (2001), 8.2 Counting Sort, pp. 168–169.
  14. Comrie (1929–1930); Cormen et al. (2001), 8.3 Radix Sort, pp. 170–173.
  15. Kirkpatrick & Reisch (1984); Albers & Hagerup (1997).
  16. Kirkpatrick & Reisch (1984).
  17. Andersson et al. (1998).
  18. Han (2004).
  19. Thorup (2002)
Secondary sources
Primary sources