Polyphase merge sort

Last updated

A polyphase merge sort is a variation of a bottom-up merge sort that sorts a list using an initial uneven distribution of sub-lists (runs), primarily used for external sorting, and is more efficient than an ordinary merge sort when there are fewer than eight external working files (such as a tape drive or a file on a hard drive). A polyphase merge sort is not a stable sort.

Contents

Ordinary merge sort

A merge sort splits the records of a dataset into sorted runs of records and then repeatedly merges sorted runs into larger sorted runs until only one run, the sorted dataset, remains.

An ordinary merge sort using four working files organizes them as a pair of input files and a pair of output files. The dataset is distributed evenly between two of the working files, either as sorted runs or in the simplest case, single records, which can be considered to be sorted runs of size 1. Once all of the dataset is transferred to the two working files, those two working files become the input files for the first merge iteration. Each merge iteration merges runs from the two input working files, alternating the merged output between the two output files, again distributing the merged runs evenly between the two output files (until the final merge iteration). Once all of the runs from the two inputs files are merged and output, then the output files become the input files and vice versa for the next merge iteration. The number of runs decreases by a factor of 2 at each iteration, such as 64, 32, 16, 8, 4, 2, 1. For the final merge iteration, the two input files only have one sorted run (1/2 of the dataset) each, and the merged result is a single sorted run (the sorted dataset) on one of the output files. This is also described at Merge sort § Use with tape drives.

If there are only three working files, then an ordinary merge sort merges sorted runs from two working files onto a single working file, then distributes the runs evenly between the two output files. The merge iteration reduces run count by a factor of 2, the redistribute iteration doesn't reduce run count (the factor is 1). Each iteration could be considered to reduce the run count by an average factor of 2 ≈ 1.41. If there are 5 working files, then the pattern alternates between a 3 way merge and a 2 way merge, for an average factor of 6 ≈ 2.45.

In general, for an even number N of working files, each iteration of an ordinary merge sort reduces run count by a factor of N/2, while for an odd number N of working files, each iteration reduces the run count by an average factor of (N2−1)/4 = N2−1/2.

Polyphase merge

For N< 8 working files, a polyphase merge sort achieves a higher effective run count reduction factor by unevenly distributing sorted runs between N−1 working files (explained in next section). Each iteration merges runs from N−1 working files onto a single output working file. When the end of one of the N−1 working files is reached, then it becomes the new output file and what was the output file becomes one of the N−1 working input files, starting a new iteration of polyphase merge sort. Each iteration merges only a fraction of the dataset (about 1/2 to 3/4), except for the last iteration which merges all of the dataset into a single sorted run. The initial distribution is set up so that only one input working file is emptied at a time, except for the final merge iteration which merges N−1 single runs (of varying size, this is explained next) from the N−1 input working files to the single output file, resulting in a single sorted run, the sorted dataset.

For each polyphase iteration, the total number of runs follows a pattern similar to a reversed Fibonacci numbers of higher order sequence. With 4 files, and a dataset consisting of 57 runs, the total run count on each iteration would be 57, 31, 17, 9, 5, 3, 1. [1] [2] Note that except for the last iteration, the run count reduction factor is a bit less than 2, 57/31, 31/17, 17/9, 9/5, 5/3, 3/1, about 1.84 for a 4 file case, but each iteration except the last reduced the run count while processing about 65% of the dataset, so the run count reduction factor per dataset processed during the intermediate iterations is about 1.84 / 0.65 = 2.83. For a dataset consisting of 57 runs of 1 record each, then after the initial distribution, polyphase merge sort moves 232 records during the 6 iterations it takes to sort the dataset, for an overall reduction factor of 2.70 (this is explained in more detail later).

After the first polyphase iteration, what was the output file now contains the results of merging N−1 original runs, but the remaining N−2 input working files still contain the remaining original runs, so the second merge iteration produces runs of size (N−1) + (N−2) = (2N − 3) original runs. The third iteration produces runs of size (4N − 7) original runs. With 4 files, the first iteration creates runs of size 3 original runs, the second iteration 5 original runs, the third iteration 9 original runs and so on, following the Fibonacci like pattern, 1, 3, 5, 9, 17, 31, 57, ... , so the increase in run size follows the same pattern as the decrease in run count in reverse. In the example case of 4 files and 57 runs of 1 record each, the last iteration merges 3 runs of size 31, 17, 9, resulting in a single sorted run of size 31+17+9 = 57 records, the sorted dataset. An example of the run counts and run sizes for 4 files, 31 records can be found in table 4.3 of. [3]

Perfect 3 file polyphase merge sort

It is easiest to look at the polyphase merge starting from its ending conditions and working backwards. At the start of each iteration, there will be two input files and one output file. At the end of the iteration, one input file will have been completely consumed and will become the output file for the next iteration. The current output file will become an input file for the next iteration. The remaining files (just one in the 3 file case) have only been partially consumed and their remaining runs will be input for the next iteration.

File 1 just emptied and became the new output file. One run is left on each input tape, and merging those runs together will make the sorted file.

File 1 (out):                                           <1 run> *        (the sorted file) File 2 (in ): ... | <1 run> *               -->     ... <1 run> | *          (consumed) File 3 (in ):     | <1 run> *                           <1 run> | *          (consumed)  ...  possible runs that have already been read |    marks the read pointer of the file *    marks end of file 

Stepping back to the previous iteration, we were reading from 1 and 2. One run is merged from 1 and 2 before file 1 goes empty. Notice that file 2 is not completely consumed—it has one run left to match the final merge (above).

File 1 (in ): ... | <1 run> *                      ... <1 run> | * File 2 (in ):     | <2 run> *           -->            <1 run> | <1 run> * File 3 (out):                                          <1 run> * 

Stepping back another iteration, 2 runs are merged from 1 and 3 before file 3 goes empty.

File 1 (in ):     | <3 run>                        ... <2 run> | <1 run> * File 2 (out):                               -->        <2 run> * File 3 (in ): ... | <2 run> *                          <2 run> | * 

Stepping back another iteration, 3 runs are merged from 2 and 3 before file 2 goes empty.

File 1 (out):                                          <3 run> * File 2 (in ): ... | <3 run> *               -->    ... <3 run> | * File 3 (in ):     | <5 run> *                          <3 run> | <2 run> * 

Stepping back another iteration, 5 runs are merged from 1 and 2 before file 1 goes empty.

File 1 (in ): ... | <5 run> *                      ... <5 run> | * File 2 (in ):     | <8 run> *               -->        <5 run> | <3 run> * File 3 (out):                                          <5 run> * 

Distribution for polyphase merge sort

Looking at the perfect 3 file case, the number of runs for merged working backwards: 1, 1, 2, 3, 5, ... reveals a Fibonacci sequence. The sequence for more than 3 files is a bit more complicated; for 4 files, starting at the final state and working backwards, the run count pattern is {1,0,0,0}, {0,1,1,1}, {1,0,2,2}, {3,2,0,4}, {7,6,4,0}, {0,13,11,7}, {13,0,24,20}, ... .

For everything to work out optimally, the last merge phase should have exactly one run on each input file. If any input file has more than one run, then another phase would be required. Consequently, the polyphase merge sort needs to be clever about the initial distribution of the input data's runs to the initial output files. For example, an input file with 13 runs would write 5 runs to file 1 and 8 runs to file 2.

In practice, the input file will not have the exact number of runs needed for a perfect distribution. One way to deal with this is by padding the actual distribution with imaginary "dummy runs" to simulate an ideal run distribution. [1] A dummy run behaves like a run with no records in it. Merging one or more dummy runs with one or more real runs just merges the real runs, and merging one or more dummy runs with no real runs results in a single dummy run. Another approach is to emulate dummy runs as needed during the merge operations. [4]

"Optimal" distribution algorithms require knowing the number of runs in advance. Otherwise, in the more common case where the number of runs is not known in advance, "near optimal" distribution algorithms are used. Some distribution algorithms include rearranging runs. [5] If the number of runs is known in advance, only a partial distribution is needed before starting the merge phases. For example, consider the 3 file case, starting with n runs in File_1. Define Fi = Fi−1 + Fi−2 as the ith Fibonacci number. If n = Fi, then move Fi−2 runs to File_2, leaving Fi−1 runs remaining on File_1, a perfect run distribution. If Fi < n < Fi+1, move nFi runs to File_2 and Fi+1n runs to File_3. The first merge iteration merges nFi runs from File_1 and File_2, appending the nFi merged runs to the Fi+1n runs already moved to File_3. File_1 ends up with Fi−2 runs remaining, File_2 is emptied, and File_3 ends up with Fi−1 runs, again a perfect run distribution. For 4 or more files, the math is more complicated, but the concept is the same.

Comparison versus ordinary merge sort

After the initial distribution, an ordinary merge sort using 4 files will sort 16 single record runs in 4 iterations of the entire dataset, moving a total of 64 records in order to sort the dataset after the initial distribution. A polyphase merge sort using 4 files will sort 17 single record runs in 4 iterations, but since each iteration but the last iteration only moves a fraction of the dataset, it only moves a total of 48 records in order to sort the dataset after the initial distribution. In this case, ordinary merge sort factor is 2.0, while polyphase overall factor is ≈2.73.

To explain how the reduction factor is related to sort performance, the reduction factor equations are:

reduction_factor = exp(number_of_runs*log(number_of_runs)/run_move_count) run_move_count = number_of_runs * log(number_of_runs)/log(reduction_factor) run_move_count = number_of_runs * log_reduction_factor(number_of_runs)

Using the run move count equation for the above examples:

Here is a table of effective reduction factors for polyphase and ordinary merge sort listed by number of files, based on actual sorts of a few million records. This table roughly corresponds to the reduction factor per dataset moved tables shown in fig 3 and fig 4 of polyphase merge sort.pdf

# files |     average fraction of data per iteration |     |     polyphase reduction factor on ideal sized data |     |     |     ordinary reduction factor on ideal sized data |     |     |     | 3     .73   1.94  1.41  (sqrt  2) 4     .63   2.68  2.00 5     .58   3.20  2.45  (sqrt  6) 6     .56   3.56  3.00 7     .55   3.80  3.46  (sqrt 12) 8     .54   3.95  4.00 9     .53   4.07  4.47  (sqrt 20) 10    .53   4.15  5.00 11    .53   4.22  5.48  (sqrt 30) 12    .53   4.28  6.00 32    .53   4.87 16.00 

In general, polyphase merge sort is better than ordinary merge sort when there are fewer than 8 files, while ordinary merge sort starts to become better at around 8 or more files. [6] [7]

Related Research Articles

<span class="mw-page-title-main">Merge sort</span> Divide and conquer-based sorting algorithm

In computer science, merge sort is an efficient, general-purpose, and comparison-based sorting algorithm. Most implementations produce a stable sort, which means that the relative order of equal elements is the same in the input and output. Merge sort is a divide-and-conquer algorithm that was invented by John von Neumann in 1945. A detailed description and analysis of bottom-up merge sort appeared in a report by Goldstine and von Neumann as early as 1948.

In computer science, radix sort is a non-comparative sorting algorithm. It avoids comparison by creating and distributing elements into buckets according to their radix. For elements with more than one significant digit, this bucketing process is repeated for each digit, while preserving the ordering of the prior step, until all digits have been considered. For this reason, radix sort has also been called bucket sort and digital sort.

<span class="mw-page-title-main">Sorting algorithm</span> Algorithm that arranges lists in order

In computer science, a sorting algorithm is an algorithm that puts elements of a list into an order. The most frequently used orders are numerical order and lexicographical order, and either ascending or descending. Efficient sorting is important for optimizing the efficiency of other algorithms that require input data to be in sorted lists. Sorting is also often useful for canonicalizing data and for producing human-readable output.

<span class="mw-page-title-main">Jackson structured programming</span>

Jackson structured programming (JSP) is a method for structured programming developed by British software consultant Michael A. Jackson and described in his 1975 book Principles of Program Design. The technique of JSP is to analyze the data structures of the files that a program must read as input and produce as output, and then produce a program design based on those data structures, so that the program control structure handles those data structures in a natural and intuitive way.

<span class="mw-page-title-main">Bucket sort</span> Sorting algorithm

Bucket sort, or bin sort, is a sorting algorithm that works by distributing the elements of an array into a number of buckets. Each bucket is then sorted individually, either using a different sorting algorithm, or by recursively applying the bucket sorting algorithm. It is a distribution sort, a generalization of pigeonhole sort that allows multiple keys per bucket, and is a cousin of radix sort in the most-to-least significant digit flavor. Bucket sort can be implemented with comparisons and therefore can also be considered a comparison sort algorithm. The computational complexity depends on the algorithm used to sort each bucket, the number of buckets to use, and whether the input is uniformly distributed.

In computer science, counting sort is an algorithm for sorting a collection of objects according to keys that are small positive integers; that is, it is an integer sorting algorithm. It operates by counting the number of objects that possess distinct key values, and applying prefix sum on those counts to determine the positions of each key value in the output sequence. Its running time is linear in the number of items and the difference between the maximum key value and the minimum key value, so it is only suitable for direct use in situations where the variation in keys is not significantly greater than the number of items. It is often used as a subroutine in radix sort, another sorting algorithm, which can handle larger keys more efficiently.

In computer science, an in-place algorithm is an algorithm that operates directly on the input data structure without requiring extra space proportional to the input size. In other words, it modifies the input in place, without creating a separate copy of the data structure. An algorithm which is not in-place is sometimes called not-in-place or out-of-place.

<span class="mw-page-title-main">Time complexity</span> Estimate of time taken for running an algorithm

In theoretical computer science, the time complexity is the computational complexity that describes the amount of computer time it takes to run an algorithm. Time complexity is commonly estimated by counting the number of elementary operations performed by the algorithm, supposing that each elementary operation takes a fixed amount of time to perform. Thus, the amount of time taken and the number of elementary operations performed by the algorithm are taken to be related by a constant factor.

A randomized algorithm is an algorithm that employs a degree of randomness as part of its logic or procedure. The algorithm typically uses uniformly random bits as an auxiliary input to guide its behavior, in the hope of achieving good performance in the "average case" over all possible choices of random determined by the random bits; thus either the running time, or the output are random variables.

rzip is a huge-scale data compression computer program designed around initial LZ77-style string matching on a 900 MB dictionary window, followed by bzip2-based Burrows–Wheeler transform and entropy coding (Huffman) on 900 kB output chunks.

This article discusses support programs included in or available for OS/360 and successors. IBM categorizes some of these programs as utilities and others as service aids; the boundaries are not always consistent or obvious. Many, but not all, of these programs match the types in utility software.

External sorting is a class of sorting algorithms that can handle massive amounts of data. External sorting is required when the data being sorted do not fit into the main memory of a computing device and instead they must reside in the slower external memory, usually a disk drive. Thus, external sorting algorithms are external memory algorithms and thus applicable in the external memory model of computation.

MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.

<span class="mw-page-title-main">Bitonic sorter</span> Parallel sorting algorithm

Bitonic mergesort is a parallel algorithm for sorting. It is also used as a construction method for building a sorting network. The algorithm was devised by Ken Batcher. The resulting sorting networks consist of comparators and have a delay of , where is the number of items to be sorted. This makes it a popular choice for sorting large numbers of elements on an architecture which itself contains a large number of parallel execution units running in lockstep, such as a typical GPU.

The Sort/Merge utility is a mainframe program to sort records in a file into a specified order, merge pre-sorted files into a sorted file, or copy selected records. Internally, these utilities use one or more of the standard sorting algorithms, often with proprietary fine-tuned code.

The expected linear time MST algorithm is a randomized algorithm for computing the minimum spanning forest of a weighted graph with no isolated vertices. It was developed by David Karger, Philip Klein, and Robert Tarjan. The algorithm relies on techniques from Borůvka's algorithm along with an algorithm for verifying a minimum spanning tree in linear time. It combines the design paradigms of divide and conquer algorithms, greedy algorithms, and randomized algorithms to achieve expected linear performance.

In cryptography, scrypt is a password-based key derivation function created by Colin Percival in March 2009, originally for the Tarsnap online backup service. The algorithm was specifically designed to make it costly to perform large-scale custom hardware attacks by requiring large amounts of memory. In 2016, the scrypt algorithm was published by IETF as RFC 7914. A simplified version of scrypt is used as a proof-of-work scheme by a number of cryptocurrencies, first implemented by an anonymous programmer called ArtForz in Tenebrix and followed by Fairbrix and Litecoin soon after.

In computer science, k-way merge algorithms or multiway merges are a specific type of sequence merge algorithms that specialize in taking in k sorted lists and merging them into a single sorted list. These merge algorithms generally refer to merge algorithms that take in a number of sorted lists greater than two. Two-way merges are also referred to as binary merges.The k- way merge also external sorting algorithm.

References

  1. 1 2 Donald Knuth, The Art of Computer Programming, Volume 3, Addison Wesley, 1973, Algorithm 5.4.2D.
  2. "Sorting and Searching Algorithms". Archived from the original on 2012-11-22. Retrieved 2010-01-31.
  3. "External sorting". Archived from the original on 2016-01-28. Retrieved 2016-01-22.
  4. https://www.fq.math.ca/Scanned/8-1/lynch.pdf [ bare URL PDF ]
  5. http://i.stanford.edu/pub/cstr/reports/cs/tr/76/543/CS-TR-76-543.pdf [ bare URL PDF ]
  6. "Advanced Programming I Lecture Notes".
  7. http://www.mif.vu.lt/~algis/dsax/DsSort.pdf [ bare URL PDF ]

Further reading