External sorting

Last updated May 11, 2024

External sorting is a class of sorting algorithms that can handle massive amounts of data. External sorting is required when the data being sorted do not fit into the main memory of a computing device (usually RAM) and instead they must reside in the slower external memory, usually a disk drive. Thus, external sorting algorithms are external memory algorithms and thus applicable in the external memory model of computation.

External sorting algorithms generally fall into two types, distribution sorting, which resembles quicksort, and external merge sort, which resembles merge sort. External merge sort typically uses a hybrid sort-merge strategy. In the sorting phase, chunks of data small enough to fit in main memory are read, sorted, and written out to a temporary file. In the merge phase, the sorted subfiles are combined into a single larger file.

Model

External sorting algorithms can be analyzed in the external memory model. In this model, a cache or internal memory of size $M$ and an unbounded external memory are divided into blocks of size $B$ , and the running time of an algorithm is determined by the number of memory transfers between internal and external memory. Like their cache-oblivious counterparts, asymptotically optimal external sorting algorithms achieve a running time (in Big O notation) of $O\left({\tfrac {N}{B}}\log _{\tfrac {M}{B}}{\tfrac {N}{B}}\right)$ .

External merge sort

One example of external sorting is the external merge sort algorithm, which uses a K-way merge algorithm. It sorts chunks that each fit in RAM, then merges the sorted chunks together.^[1]^[2]

The algorithm first sorts $M$ items at a time and puts the sorted lists back into external memory. It does a ${\tfrac {M}{B}}$ -way merge on those sorted lists, recursing if there is not enough main memory to merge efficiently in one pass. During a merge pass, $B$ elements from each sorted list are in internal memory, and the minimum is repeatedly outputted.

For example, for sorting 900 megabytes of data using only 100 megabytes of RAM:

Read 100 MB of the data in main memory and sort by some conventional method, like quicksort.
Write the sorted data to disk.
Repeat steps 1 and 2 until all of the data is in sorted 100 MB chunks (there are 900MB / 100MB = 9 chunks), which now need to be merged into one single output file.
Read the first 10 MB (= 100MB / (9 chunks + 1)) of each sorted chunk into input buffers in main memory and allocate the remaining 10 MB for an output buffer. (In practice, it might provide better performance to make the output buffer larger and the input buffers slightly smaller.)
Perform a 9-way merge and store the result in the output buffer. Whenever the output buffer fills, write it to the final sorted file and empty it. Whenever any of the 9 input buffers empties, fill it with the next 10 MB of its associated 100 MB sorted chunk until no more data from the chunk is available.

The merge pass is key to making external merge sort work externally. The merge algorithm only makes one pass through each chunk, so chunks do not have to be loaded all at once; rather, sequential parts of the chunk are loaded as needed. And as long as the blocks read are relatively large (like the 10 MB in this example), the reads can be relatively efficient even on media with low random-read performance, like hard drives.

Historically, instead of a sort, sometimes a replacement-selection algorithm^[3] was used to perform the initial distribution, to produce on average half as many output chunks of double the length.

Additional passes

The previous example is a two-pass sort: first sort, then merge. The sort ends with a single k-way merge, rather than a series of two-way merge passes as in a typical in-memory merge sort. This is because each merge pass reads and writes every value from and to disk, so reducing the number of passes more than compensates for the additional cost of a k-way merge.

The limitation to single-pass merging is that as the number of chunks increases, memory will be divided into more buffers, so each buffer is smaller. Eventually, the reads become so small that more time is spent on disk seeks than data transfer. A typical magnetic hard disk drive might have a 10 ms access time and 100 MB/s data transfer rate, so each seek takes as much time as transferring 1 MB of data.

Thus, for sorting, say, 50 GB in 100 MB of RAM, using a single 500-way merge pass isn't efficient: we can only read 100 MB / 501 ≈ 200 KB from each chunk at once, so 5/6 of the disk's time is spent seeking. Using two merge passes solves the problem. Then the sorting process might look like this:

Run the initial chunk-sorting pass as before to create 500×100 MB sorted chunks.
Run a first merge pass combining 25×100 MB chunks at a time, resulting in 20×2.5 GB sorted chunks.
Run a second merge pass to merge the 20×2.5 GB sorted chunks into a single 50 GB sorted result

Although this requires an additional pass over the data, each read is now 4 MB long, so only 1/5 of the disk's time is spent seeking. The improvement in data transfer efficiency during the merge passes (16.6% to 80% is almost a 5× improvement) more than makes up for the doubled number of merge passes.

Variations include using an intermediate medium like solid-state disk for some stages; the fast temporary storage needn't be big enough to hold the whole dataset, just substantially larger than available main memory. Repeating the example above with 1 GB of temporary SSD storage, the first pass could merge 10×100 MB sorted chunks read from that temporary space to write 50x1 GB sorted chunks to HDD. The high bandwidth and random-read throughput of SSDs help speed the first pass, and the HDD reads for the second pass can then be 2 MB, large enough that seeks will not take up most of the read time. SSDs can also be used as read buffers in a merge phase, allowing fewer larger reads (20MB reads in this example) from HDD storage. Given the lower cost of SSD capacity relative to RAM, SSDs can be an economical tool for sorting large inputs with very limited memory.

Like in-memory sorts, efficient external sorts require O(n log n) time: exponentially growing datasets require linearly increasing numbers of passes that each take O(n) time.^[4] Under reasonable assumptions at least 500 GB of data stored on a hard drive can be sorted using 1 GB of main memory before a third pass becomes advantageous, and many times that much data can be sorted before a fourth pass becomes useful.^[5]

Main memory size is important. Doubling memory dedicated to sorting halves the number of chunks and the number of reads per chunk, reducing the number of seeks required by about three-quarters. The ratio of RAM to disk storage on servers often makes it convenient to do huge sorts on a cluster of machines^[6] rather than on one machine with multiple passes. Media with high random-read performance like solid-state drives (SSDs) also increase the amount that can be sorted before additional passes improve performance.

External distribution sort

External distribution sort is analogous to quicksort. The algorithm finds approximately ${\tfrac {M}{B}}$ pivots and uses them to divide the $N$ elements into approximately equally sized subarrays, each of whose elements are all smaller than the next, and then recurse until the sizes of the subarrays are less than the block size. When the subarrays are less than the block size, sorting can be done quickly because all reads and writes are done in the cache, and in the external memory model requires $O(1)$ operations.

However, finding exactly ${\tfrac {M}{B}}$ pivots would not be fast enough to make the external distribution sort asymptotically optimal. Instead, we find slightly fewer pivots. To find these pivots, the algorithm splits the $N$ input elements into ${\tfrac {N}{M}}$ chunks, and takes every ${\sqrt {\tfrac {M}{16B}}}$ elements, and recursively uses the median of medians algorithm to find ${\sqrt {\tfrac {M}{B}}}$ pivots.^[7]

There is a duality, or fundamental similarity, between merge- and distribution-based algorithms.^[8]

Performance

The Sort Benchmark, created by computer scientist Jim Gray, compares external sorting algorithms implemented using finely tuned hardware and software. Winning implementations use several techniques:

Using parallelism
- Multiple disk drives can be used in parallel in order to improve sequential read and write speed. This can be a very cost-efficient improvement: a Sort Benchmark winner in the cost-centric Penny Sort category uses six hard drives in an otherwise midrange machine.^[9]
- Sorting software can use multiple threads, to speed up the process on modern multicore computers.
- Software can use asynchronous I/O so that one run of data can be sorted or merged while other runs are being read from or written to disk.
- Multiple machines connected by fast network links can each sort part of a huge dataset in parallel.^[10]
Increasing hardware speed
- Using more RAM for sorting can reduce the number of disk seeks and avoid the need for more passes.
- Fast external memory like solid-state drives can speed sorts, either if the data is small enough to fit entirely on SSDs or, more rarely, to accelerate sorting SSD-sized chunks in a three-pass sort.
- Many other factors can affect hardware's maximum sorting speed: CPU speed and number of cores, RAM access latency, input/output bandwidth, disk read/write speed, disk seek time, and others. "Balancing" the hardware to minimize bottlenecks is an important part of designing an efficient sorting system.
- Cost-efficiency as well as absolute speed can be critical, especially in cluster environments where lower node costs allow purchasing more nodes.
Increasing software speed
- Some Sort Benchmark entrants use a variation on radix sort for the first phase of sorting: they separate data into one of many "bins" based on the beginning of its value. Sort Benchmark data is random and especially well-suited to this optimization.
- Compacting the input, intermediate files, and output can reduce time spent on I/O, but is not allowed in the Sort Benchmark.
- Because the Sort Benchmark sorts long (100-byte) records using short (10-byte) keys, sorting software sometimes rearranges the keys separately from the values to reduce memory I/O volume.

Related Research Articles

<span class="mw-page-title-main">Merge sort</span> Divide and conquer sorting algorithm

In computer science, merge sort is an efficient, general-purpose, and comparison-based sorting algorithm. Most implementations produce a stable sort, which means that the relative order of equal elements is the same in the input and output. Merge sort is a divide-and-conquer algorithm that was invented by John von Neumann in 1945. A detailed description and analysis of bottom-up merge sort appeared in a report by Goldstine and von Neumann as early as 1948.

In computer science, radix sort is a non-comparative sorting algorithm. It avoids comparison by creating and distributing elements into buckets according to their radix. For elements with more than one significant digit, this bucketing process is repeated for each digit, while preserving the ordering of the prior step, until all digits have been considered. For this reason, radix sort has also been called bucket sort and digital sort.

<span class="mw-page-title-main">Sorting algorithm</span> Algorithm that arranges lists in order

In computer science, a sorting algorithm is an algorithm that puts elements of a list into an order. The most frequently used orders are numerical order and lexicographical order, and either ascending or descending. Efficient sorting is important for optimizing the efficiency of other algorithms that require input data to be in sorted lists. Sorting is also often useful for canonicalizing data and for producing human-readable output.

In computer science, algorithmic efficiency is a property of an algorithm which relates to the amount of computational resources used by the algorithm. Algorithmic efficiency can be thought of as analogous to engineering productivity for a repeating or continuous process.

rzip is a huge-scale data compression computer program designed around initial LZ77-style string matching on a 900 MB dictionary window, followed by bzip2-based Burrows–Wheeler transform and entropy coding (Huffman) on 900 kB output chunks.

In computing, a cache-oblivious algorithm is an algorithm designed to take advantage of a processor cache without having the size of the cache as an explicit parameter. An optimal cache-oblivious algorithm is a cache-oblivious algorithm that uses the cache optimally. Thus, a cache-oblivious algorithm is designed to perform well, without modification, on multiple machines with different cache sizes, or for a memory hierarchy with different levels of cache having different sizes. Cache-oblivious algorithms are contrasted with explicit loop tiling, which explicitly breaks a problem into blocks that are optimally sized for a given cache.

In computing, external memory algorithms or out-of-core algorithms are algorithms that are designed to process data that are too large to fit into a computer's main memory at once. Such algorithms must be optimized to efficiently fetch and access data stored in slow bulk memory such as hard drives or tape drives, or when memory is on a computer network. External memory algorithms are analyzed in the external memory model.

Input/output operations per second is an input/output performance measurement used to characterize computer storage devices like hard disk drives (HDD), solid state drives (SSD), and storage area networks (SAN). Like benchmarks, IOPS numbers published by storage device manufacturers do not directly relate to real-world application performance.

In computing, a hybrid drive is a logical or physical storage device that combines a faster storage medium such as solid-state drive (SSD) with a higher-capacity hard disk drive (HDD). The intent is adding some of the speed of SSDs to the cost-effective storage capacity of traditional HDDs. The purpose of the SSD in a hybrid drive is to act as a cache for the data stored on the HDD, improving the overall performance by keeping copies of the most frequently used data on the faster SSD drive.

In computing, external storage refers to non-volatile (secondary) data storage outside a computer's own internal hardware, and thus can be readily disconnected and accessed elsewhere. Such storage devices may refer to removable media, compact flash drives, portable storage devices, or network-attached storage. Web-based cloud storage is the latest technology for external storage.

<span class="mw-page-title-main">ReadyBoost</span> Disk caching component of Microsoft Windows

ReadyBoost is a disk caching software component developed by Microsoft for Windows Vista and included in later versions of Windows. ReadyBoost enables NAND memory mass storage CompactFlash, SD card, and USB flash drive devices to be used as a cache between the hard drive and random access memory in an effort to increase computing performance. ReadyBoost relies on the SuperFetch and also adjusts its cache based on user activity. ReadyDrive for hybrid drives is implemented in a manner similar to ReadyBoost.

A solid-state drive (SSD) is a solid-state storage device. It provides persistent data storage using no moving parts. It is sometimes called semiconductor storage device or solid-state device; it is also called solid-state disk because it is frequently interfaced to a host system as a hard disk drive.

In computer storage, the standard RAID levels comprise a basic set of RAID configurations that employ the techniques of striping, mirroring, or parity to create large reliable data stores from multiple general-purpose computer hard disk drives (HDDs). The most common types are RAID 0 (striping), RAID 1 (mirroring) and its variants, RAID 5, and RAID 6. Multiple RAID levels can also be combined or nested, for instance RAID 10 or RAID 01. RAID levels and their associated data formats are standardized by the Storage Networking Industry Association (SNIA) in the Common RAID Disk Drive Format (DDF) standard. The numerical values only serve as identifiers and do not signify performance, reliability, generation, or any other metric.

In computer storage, a disk buffer is the embedded memory in a hard disk drive (HDD) or solid state drive (SSD) acting as a buffer between the rest of the computer and the physical hard disk platter or flash memory that is used for storage. Modern hard disk drives come with 8 to 256 MiB of such memory, and solid-state drives come with up to 4 GB of cache memory.

Higher performance in hard disk drives comes from devices which have better performance characteristics. These performance characteristics can be grouped into two categories: access time and data transfer time .

In computer science, a fractal tree index is a tree data structure that keeps data sorted and allows searches and sequential access in the same time as a B-tree but with insertions and deletions that are asymptotically faster than a B-tree. Like a B-tree, a fractal tree index is a generalization of a binary search tree in that a node can have more than two children. Furthermore, unlike a B-tree, a fractal tree index has buffers at each node, which allow insertions, deletions and other changes to be stored in intermediate locations. The goal of the buffers is to schedule disk writes so that each write performs a large amount of useful work, thereby avoiding the worst-case performance of B-trees, in which each disk write may change a small amount of data on disk. Like a B-tree, fractal tree indexes are optimized for systems that read and write large blocks of data. The fractal tree index has been commercialized in databases by Tokutek. Originally, it was implemented as a cache-oblivious lookahead array, but the current implementation is an extension of the B^ε tree. The B^ε is related to the Buffered Repository Tree. The Buffered Repository Tree has degree 2, whereas the B^ε tree has degree B^ε. The fractal tree index has also been used in a prototype filesystem. An open source implementation of the fractal tree index is available, which demonstrates the implementation details outlined below.

Funnelsort is a comparison-based sorting algorithm. It is similar to mergesort, but it is a cache-oblivious algorithm, designed for a setting where the number of elements to sort is too large to fit in a cache where operations are done. It was introduced by Matteo Frigo, Charles Leiserson, Harald Prokop, and Sridhar Ramachandran in 1999 in the context of the cache oblivious model.

In computer science, k-way merge algorithms or multiway merges are a specific type of sequence merge algorithms that specialize in taking in k sorted lists and merging them into a single sorted list. These merge algorithms generally refer to merge algorithms that take in a number of sorted lists greater than two. Two-way merges are also referred to as binary merges.The k- way merge is also an external sorting algorithm.

In computer science, a parallel external memory (PEM) model is a cache-aware, external-memory abstract machine. It is the parallel-computing analogy to the single-processor external memory (EM) model. In a similar way, it is the cache-aware analogy to the parallel random-access machine (PRAM). The PEM model consists of a number of processors, together with their respective private caches and a shared main memory.

External memory graph traversal is a type of graph traversal optimized for accessing externally stored memory.

References

↑ Donald Knuth, The Art of Computer Programming , Volume 3: Sorting and Searching, Second Edition. Addison-Wesley, 1998, ISBN 0-201-89685-0, Section 5.4: External Sorting, pp.248–379.
↑ Ellis Horowitz and Sartaj Sahni, Fundamentals of Data Structures, H. Freeman & Co., ISBN 0-7167-8042-9.
↑ Donald Knuth, The Art of Computer Programming, Volume 3: Sorting and Searching, Second Edition. Addison-Wesley, 1998, ISBN 0-201-89685-0, Section 5.4: External Sorting, pp.254–ff.
↑ One way to see this is that given a fixed amount of memory (say, 1GB) and a minimum read size (say, 2MB), each merge pass can merge a certain number of runs (such as 500) into one, creating a divide-and-conquer situation similar to in-memory merge sort.
↑ For an example, assume 500 GB of data to sort, 1 GB of buffer memory, and a single disk with 200 MB/s transfer rate and 20 ms seek time. A single 500-way merging phase will use buffers of 2 MB each, and need to do 250 K seeks while reading then writing 500 GB. It will spend 5,000 seconds seeking and 5,000 s transferring. Doing two merge passes as described above would nearly eliminate the seek time but add an additional 5,000 s of data transfer time, so this is approximately the break-even point between a two-pass and three-pass sort.
↑ Chris Nyberg, Mehul Shah, Sort Benchmark Home Page (links to examples of parallel sorts)
↑ Aggarwal, Alok; Vitter, Jeffrey (1988). "The input/output complexity of sorting and related problems" (PDF). Communications of the ACM . 31 (9): 1116–1127. doi:10.1145/48529.48535.
↑ J. S. Vitter, Algorithms and Data Structures for External Memory , Series on Foundations and Trends in Theoretical Computer Science, now Publishers, Hanover, MA, 2008, ISBN 978-1-60198-106-6.
↑ Nikolas Askitis, OzSort 2.0: Sorting up to 252GB for a Penny
↑ Rasmussen et al., TritonSort

External links

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Donald Knuth, The Art of Computer Programming , Volume 3: Sorting and Searching, Second Edition. Addison-Wesley, 1998, ISBN 0-201-89685-0, Section 5.4: External Sorting, pp.248–379.

[2] Ellis Horowitz and Sartaj Sahni, Fundamentals of Data Structures, H. Freeman & Co., ISBN 0-7167-8042-9.

[3] Donald Knuth, The Art of Computer Programming, Volume 3: Sorting and Searching, Second Edition. Addison-Wesley, 1998, ISBN 0-201-89685-0, Section 5.4: External Sorting, pp.254–ff.

[4] One way to see this is that given a fixed amount of memory (say, 1GB) and a minimum read size (say, 2MB), each merge pass can merge a certain number of runs (such as 500) into one, creating a divide-and-conquer situation similar to in-memory merge sort.

[5] For an example, assume 500 GB of data to sort, 1 GB of buffer memory, and a single disk with 200 MB/s transfer rate and 20 ms seek time. A single 500-way merging phase will use buffers of 2 MB each, and need to do 250 K seeks while reading then writing 500 GB. It will spend 5,000 seconds seeking and 5,000 s transferring. Doing two merge passes as described above would nearly eliminate the seek time but add an additional 5,000 s of data transfer time, so this is approximately the break-even point between a two-pass and three-pass sort.

[6] Chris Nyberg, Mehul Shah, Sort Benchmark Home Page (links to examples of parallel sorts)

[Aggarwal88-7] Aggarwal, Alok; Vitter, Jeffrey (1988). "The input/output complexity of sorting and related problems" (PDF). Communications of the ACM . 31 (9): 1116–1127. doi:10.1145/48529.48535.

[8] J. S. Vitter, Algorithms and Data Structures for External Memory , Series on Foundations and Trends in Theoretical Computer Science, now Publishers, Hanover, MA, 2008, ISBN 978-1-60198-106-6.

[9] Nikolas Askitis, OzSort 2.0: Sorting up to 252GB for a Penny

[10] Rasmussen et al., TritonSort

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]