Polyphase merge sort

Last updated

A polyphase merge sort is a variation of a bottom-up merge sort that sorts a list using an initial uneven distribution of sub-lists (runs), primarily used for external sorting, and is more efficient than an ordinary merge sort when there are fewer than eight external working files (such as a tape drive or a file on a hard drive). A polyphase merge sort is not a stable sort.

Contents

Balanced merge sort

A merge sort splits the records of a dataset into sorted runs of records and then repeatedly merges sorted runs into larger sorted runs until only one run, the sorted dataset, remains.

A ‘balanced’ merge sort using four working files organizes them as a pair of input files and a pair of output files. The dataset is distributed evenly between two of the working files, either as sorted runs or in the simplest case, single records, which can be considered to be sorted runs of size 1. Once all of the dataset is transferred to the two working files, those two working files become the input files for the first merge iteration. Each merge iteration merges runs from the two input working files, alternating the merged output between the two output files, again distributing the merged runs evenly between the two output files (until the final merge iteration). Once all of the runs from the two inputs files are merged and output, then the output files become the input files and vice versa for the next merge iteration. The number of runs decreases by a factor of 2 at each iteration, such as 64, 32, 16, 8, 4, 2, 1. For the final merge iteration, the two input files only have one sorted run (1/2 of the dataset) each, and the merged result is a single sorted run (the sorted dataset) on one of the output files. This is also described at Merge sort § Use with tape drives.

If there are only three working files, then a balanced merge sort merges sorted runs from two working files onto a single working file, then distributes the runs evenly between the two output files. The merge iteration reduces run count by a factor of 2, the redistribute iteration doesn't reduce run count (the factor is 1). Each iteration could be considered to reduce the run count by an average factor of 2 ≈ 1.41. If there are 5 working files, then the pattern alternates between a 3 way merge and a 2 way merge, for an average factor of 6 ≈ 2.45.

In general, for an even number N of working files, each iteration of a balanced merge sort reduces run count by a factor of N/2, while for an odd number N of working files, each iteration reduces the run count by an average factor of (N2−1)/4 = N2−1/2.

Polyphase merge

For N< 8 working files, a polyphase merge sort achieves a higher effective run count reduction factor by unevenly distributing sorted runs between N−1 working files (explained in next section). Each iteration merges runs from N−1 working files onto a single output working file. When the end of one of the N−1 working files is reached, then it becomes the new output file and what was the output file becomes one of the N−1 working input files, starting a new iteration of polyphase merge sort. Each iteration merges only a fraction of the dataset (about 1/2 to 3/4), except for the last iteration which merges all of the dataset into a single sorted run. The initial distribution is set up so that only one input working file is emptied at a time, except for the final merge iteration which merges N−1 single runs (of varying size, this is explained next) from the N−1 input working files to the single output file, resulting in a single sorted run, the sorted dataset.

For each polyphase iteration, the total number of runs follows a pattern similar to a reversed Fibonacci numbers of higher order sequence. With 4 files, and a dataset consisting of 57 runs, the total run count on each iteration would be 57, 31, 17, 9, 5, 3, 1. [1] [2] Note that except for the last iteration, the run count reduction factor is a bit less than 2, 57/31, 31/17, 17/9, 9/5, 5/3, 3/1, about 1.84 for a 4 file case, but each iteration except the last reduced the run count while processing about 65% of the dataset, so the run count reduction factor per dataset processed during the intermediate iterations is about 1.84 / 0.65 = 2.83. For a dataset consisting of 57 runs of 1 record each, then after the initial distribution, polyphase merge sort moves 232 records during the 6 iterations it takes to sort the dataset, for an overall reduction factor of 2.70 (this is explained in more detail later).

After the first polyphase iteration, what was the output file now contains the results of merging N−1 original runs, but the remaining N−2 input working files still contain the remaining original runs, so the second merge iteration produces runs of size (N−1) + (N−2) = (2N − 3) original runs. The third iteration produces runs of size (4N − 7) original runs. With 4 files, the first iteration creates runs of size 3 original runs, the second iteration 5 original runs, the third iteration 9 original runs and so on, following the Fibonacci like pattern, 1, 3, 5, 9, 17, 31, 57, ... , so the increase in run size follows the same pattern as the decrease in run count in reverse. In the example case of 4 files and 57 runs of 1 record each, the last iteration merges 3 runs of size 31, 17, 9, resulting in a single sorted run of size 31+17+9 = 57 records, the sorted dataset. An example of the run counts and run sizes for 4 files, 31 records can be found in table 4.3 of. [3]

Perfect 3 file polyphase merge sort

It is easiest to look at the polyphase merge starting from its ending conditions and working backwards. At the start of each iteration, there will be two input files and one output file. At the end of the iteration, one input file will have been completely consumed and will become the output file for the next iteration. The current output file will become an input file for the next iteration. The remaining files (just one in the 3 file case) have only been partially consumed and their remaining runs will be input for the next iteration.

File 1 just emptied and became the new output file. One run is left on each input tape, and merging those runs together will make the sorted file.

File 1 (out):                                           <1 run> *        (the sorted file) File 2 (in ): ... | <1 run> *               -->     ... <1 run> | *          (consumed) File 3 (in ):     | <1 run> *                           <1 run> | *          (consumed)  ...  possible runs that have already been read |    marks the read pointer of the file *    marks end of file 

Stepping back to the previous iteration, we were reading from 1 and 2. One run is merged from 1 and 2 before file 1 goes empty. Notice that file 2 is not completely consumed—it has one run left to match the final merge (above).

File 1 (in ): ... | <1 run> *                      ... <1 run> | * File 2 (in ):     | <2 run> *           -->            <1 run> | <1 run> * File 3 (out):                                          <1 run> * 

Stepping back another iteration, 2 runs are merged from 1 and 3 before file 3 goes empty.

File 1 (in ):     | <3 run>                        ... <2 run> | <1 run> * File 2 (out):                               -->        <2 run> * File 3 (in ): ... | <2 run> *                          <2 run> | * 

Stepping back another iteration, 3 runs are merged from 2 and 3 before file 2 goes empty.

File 1 (out):                                          <3 run> * File 2 (in ): ... | <3 run> *               -->    ... <3 run> | * File 3 (in ):     | <5 run> *                          <3 run> | <2 run> * 

Stepping back another iteration, 5 runs are merged from 1 and 2 before file 1 goes empty.

File 1 (in ): ... | <5 run> *                      ... <5 run> | * File 2 (in ):     | <8 run> *               -->        <5 run> | <3 run> * File 3 (out):                                          <5 run> * 

Distribution for polyphase merge sort

Looking at the perfect 3 file case, the number of runs for merged working backwards: 1, 1, 2, 3, 5, ... reveals a Fibonacci sequence. The sequence for more than 3 files is a bit more complicated; for 4 files, starting at the final state and working backwards, the run count pattern is {1,0,0,0}, {0,1,1,1}, {1,0,2,2}, {3,2,0,4}, {7,6,4,0}, {0,13,11,7}, {13,0,24,20}, ... .

For everything to work out optimally, the last merge phase should have exactly one run on each input file. If any input file has more than one run, then another phase would be required. Consequently, the polyphase merge sort needs to be clever about the initial distribution of the input data's runs to the initial output files. For example, an input file with 13 runs would write 5 runs to file 1 and 8 runs to file 2.

In practice, the input file will not have the exact number of runs needed for a perfect distribution. One way to deal with this is by padding the actual distribution with imaginary "dummy runs" to simulate an ideal run distribution. [1] A dummy run behaves like a run with no records in it. Merging one or more dummy runs with one or more real runs just merges the real runs, and merging one or more dummy runs with no real runs results in a single dummy run. Another approach is to emulate dummy runs as needed during the merge operations. [4]

"Optimal" distribution algorithms require knowing the number of runs in advance. Otherwise, in the more common case where the number of runs is not known in advance, "near optimal" distribution algorithms are used. Some distribution algorithms include rearranging runs. [5] If the number of runs is known in advance, only a partial distribution is needed before starting the merge phases. For example, consider the 3 file case, starting with n runs in File_1. Define Fi = Fi−1 + Fi−2 as the ith Fibonacci number. If n = Fi, then move Fi−2 runs to File_2, leaving Fi−1 runs remaining on File_1, a perfect run distribution. If Fi < n < Fi+1, move nFi runs to File_2 and Fi+1n runs to File_3. The first merge iteration merges nFi runs from File_1 and File_2, appending the nFi merged runs to the Fi+1n runs already moved to File_3. File_1 ends up with Fi−2 runs remaining, File_2 is emptied, and File_3 ends up with Fi−1 runs, again a perfect run distribution. For 4 or more files, the math is more complicated, but the concept is the same.

Comparison versus balanced merge sort

After the initial distribution, a balanced merge sort using 4 files will sort 16 single record runs in 4 iterations of the entire dataset, moving a total of 64 records in order to sort the dataset after the initial distribution. A polyphase merge sort using 4 files will sort 17 single record runs in 4 iterations, but since each iteration but the last iteration only moves a fraction of the dataset, it only moves a total of 48 records in order to sort the dataset after the initial distribution. In this case, balanced merge sort factor is 2.0, while polyphase overall factor is ≈2.73.

To explain how the reduction factor is related to sort performance, the reduction factor equations are:

reduction_factor = exp(number_of_runs*log(number_of_runs)/run_move_count) run_move_count = number_of_runs * log(number_of_runs)/log(reduction_factor) run_move_count = number_of_runs * log_reduction_factor(number_of_runs)

Using the run move count equation for the above examples:

Here is a table of effective reduction factors for polyphase and balanced merge sort listed by number of files, based on actual sorts of a few million records. This table roughly corresponds to the reduction factor per dataset moved tables shown in fig 3 and fig 4 of polyphase merge sort.pdf

# files |     average fraction of data per iteration |     |     polyphase reduction factor on ideal sized data |     |     |     balanced reduction factor on ideal sized data |     |     |     | 3     .73   1.94  1.41  (sqrt  2) 4     .63   2.68  2.00 5     .58   3.20  2.45  (sqrt  6) 6     .56   3.56  3.00 7     .55   3.80  3.46  (sqrt 12) 8     .54   3.95  4.00 9     .53   4.07  4.47  (sqrt 20) 10    .53   4.15  5.00 11    .53   4.22  5.48  (sqrt 30) 12    .53   4.28  6.00 32    .53   4.87 16.00 

In general, polyphase merge sort is better than balanced merge sort when there are fewer than 8 files, while balanced merge sort starts to become better at around 8 or more files. [6] [7]

Related Research Articles

<span class="mw-page-title-main">Analysis of algorithms</span> Study of resources used by an algorithm

In computer science, the analysis of algorithms is the process of finding the computational complexity of algorithms—the amount of time, storage, or other resources needed to execute them. Usually, this involves determining a function that relates the size of an algorithm's input to the number of steps it takes or the number of storage locations it uses. An algorithm is said to be efficient when this function's values are small, or grow slowly compared to a growth in the size of the input. Different inputs of the same size may cause the algorithm to have different behavior, so best, worst and average case descriptions might all be of practical interest. When not otherwise specified, the function describing the performance of an algorithm is usually an upper bound, determined from the worst case inputs to the algorithm.

<span class="mw-page-title-main">Heapsort</span> A sorting algorithm which uses the heap data structure

In computer science, heapsort is a comparison-based sorting algorithm which can be thought of as "an implementation of selection sort using the right data structure." Like selection sort, heapsort divides its input into a sorted and an unsorted region, and it iteratively shrinks the unsorted region by extracting the largest element from it and inserting it into the sorted region. Unlike selection sort, heapsort does not waste time with a linear-time scan of the unsorted region; rather, heap sort maintains the unsorted region in a heap data structure to efficiently find the largest element in each step.

<span class="mw-page-title-main">Merge sort</span> Divide and conquer sorting algorithm

In computer science, Merge Sort is an efficient, general-purpose, and comparison-based sorting algorithm. Most implementations produce a stable sort, which means that the relative order of equal elements is the same in the input and output. Merge sort is a divide-and-conquer algorithm that was invented by John von Neumann in 1945. A detailed description and analysis of bottom-up merge sort appeared in a report by Goldstine and von Neumann as early as 1948.

Run-length encoding (RLE) is a form of lossless data compression in which runs of data are stored as a single occurrence of that data value and a count of its consecutive occurrences, rather than as the original run. As an imaginary example of the concept, when encoding an image built up from colored dots, the sequence "green green green green green green green green green" is shortened to "green x 9". This is most efficient on data that contains many such runs, for example, simple graphic images such as icons, line drawings, games, and animations. For files that do not have many runs, encoding them with RLE could increase the file size.

<span class="mw-page-title-main">Sorting algorithm</span> Algorithm that arranges lists in order

In computer science, a sorting algorithm is an algorithm that puts elements of a list into an order. The most frequently used orders are numerical order and lexicographical order, and either ascending or descending. Efficient sorting is important for optimizing the efficiency of other algorithms that require input data to be in sorted lists. Sorting is also often useful for canonicalizing data and for producing human-readable output.

<span class="mw-page-title-main">Jackson structured programming</span>

Jackson structured programming (JSP) is a method for structured programming developed by British software consultant Michael A. Jackson and described in his 1975 book Principles of Program Design. The technique of JSP is to analyze the data structures of the files that a program must read as input and produce as output, and then produce a program design based on those data structures, so that the program control structure handles those data structures in a natural and intuitive way.

<span class="mw-page-title-main">Bucket sort</span> Sorting algorithm

Bucket sort, or bin sort, is a sorting algorithm that works by distributing the elements of an array into a number of buckets. Each bucket is then sorted individually, either using a different sorting algorithm, or by recursively applying the bucket sorting algorithm. It is a distribution sort, a generalization of pigeonhole sort that allows multiple keys per bucket, and is a cousin of radix sort in the most-to-least significant digit flavor. Bucket sort can be implemented with comparisons and therefore can also be considered a comparison sort algorithm. The computational complexity depends on the algorithm used to sort each bucket, the number of buckets to use, and whether the input is uniformly distributed.

In computer science, counting sort is an algorithm for sorting a collection of objects according to keys that are small positive integers; that is, it is an integer sorting algorithm. It operates by counting the number of objects that possess distinct key values, and applying prefix sum on those counts to determine the positions of each key value in the output sequence. Its running time is linear in the number of items and the difference between the maximum key value and the minimum key value, so it is only suitable for direct use in situations where the variation in keys is not significantly greater than the number of items. It is often used as a subroutine in radix sort, another sorting algorithm, which can handle larger keys more efficiently.

In computer science, an in-place algorithm is an algorithm that operates directly on the input data structure without requiring extra space proportional to the input size. In other words, it modifies the input in place, without creating a separate copy of the data structure. An algorithm which is not in-place is sometimes called not-in-place or out-of-place.

rzip is a huge-scale data compression computer program designed around initial LZ77-style string matching on a 900 MB dictionary window, followed by bzip2-based Burrows–Wheeler transform and entropy coding (Huffman) on 900 kB output chunks.

This article discusses support programs included in or available for OS/360 and successors. IBM categorizes some of these programs as utilities and others as service aids; the boundaries are not always consistent or obvious. Many, but not all, of these programs match the types in utility software.

<span class="mw-page-title-main">External sorting</span> Class of sorting algorithms that can handle massive amounts of data

External sorting is a class of sorting algorithms that can handle massive amounts of data. External sorting is required when the data being sorted do not fit into the main memory of a computing device and instead they must reside in the slower external memory, usually a disk drive. Thus, external sorting algorithms are external memory algorithms and thus applicable in the external memory model of computation.

MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel and distributed algorithm on a cluster.

<span class="mw-page-title-main">Comparison sort</span> Type of sorting algorithm that works by comparing pairs of elements

A comparison sort is a type of sorting algorithm that only reads the list elements through a single abstract comparison operation that determines which of two elements should occur first in the final sorted list. The only requirement is that the operator forms a total preorder over the data, with:

  1. if ab and bc then ac (transitivity)
  2. for all a and b, ab or ba (connexity).

The Sort/Merge utility is a mainframe program to sort records in a file into a specified order, merge pre-sorted files into a sorted file, or copy selected records. Internally, these utilities use one or more of the standard sorting algorithms, often with proprietary fine-tuned code.

Samplesort is a sorting algorithm that is a divide and conquer algorithm often used in parallel processing systems. Conventional divide and conquer sorting algorithms partitions the array into sub-intervals or buckets. The buckets are then sorted individually and then concatenated together. However, if the array is non-uniformly distributed, the performance of these sorting algorithms can be significantly throttled. Samplesort addresses this issue by selecting a sample of size s from the n-element sequence, and determining the range of the buckets by sorting the sample and choosing p−1 < s elements from the result. These elements then divide the array into p approximately equal-sized buckets. Samplesort is described in the 1970 paper, "Samplesort: A Sampling Approach to Minimal Storage Tree Sorting", by W. D. Frazer and A. C. McKellar.

Timsort is a hybrid, stable sorting algorithm, derived from merge sort and insertion sort, designed to perform well on many kinds of real-world data. It was implemented by Tim Peters in 2002 for use in the Python programming language. The algorithm finds subsequences of the data that are already ordered (runs) and uses them to sort the remainder more efficiently. This is done by merging runs until certain criteria are fulfilled. Timsort has been Python's standard sorting algorithm since version 2.3, and is used to sort arrays of non-primitive type in Java SE 7, on the Android platform, in GNU Octave, on V8, Swift, and inspired the sorting algorithm used in Rust.

In cryptography, scrypt is a password-based key derivation function created by Colin Percival in March 2009, originally for the Tarsnap online backup service. The algorithm was specifically designed to make it costly to perform large-scale custom hardware attacks by requiring large amounts of memory. In 2016, the scrypt algorithm was published by IETF as RFC 7914. A simplified version of scrypt is used as a proof-of-work scheme by a number of cryptocurrencies, first implemented by an anonymous programmer called ArtForz in Tenebrix and followed by Fairbrix and Litecoin soon after.

In computer science, k-way merge algorithms or multiway merges are a specific type of sequence merge algorithms that specialize in taking in k sorted lists and merging them into a single sorted list. These merge algorithms generally refer to merge algorithms that take in a number of sorted lists greater than two. Two-way merges are also referred to as binary merges. The k-way merge is also an external sorting algorithm.

References

  1. 1 2 Donald Knuth, The Art of Computer Programming, Volume 3, Addison Wesley, 1973, Algorithm 5.4.2D.
  2. "Sorting and Searching Algorithms". Archived from the original on 2012-11-22. Retrieved 2010-01-31.
  3. "External sorting". Archived from the original on 2016-01-28. Retrieved 2016-01-22.
  4. https://www.fq.math.ca/Scanned/8-1/lynch.pdf [ bare URL PDF ]
  5. http://i.stanford.edu/pub/cstr/reports/cs/tr/76/543/CS-TR-76-543.pdf [ bare URL PDF ]
  6. "Advanced Programming I Lecture Notes". Archived from the original on 2016-01-27. Retrieved 2016-01-14.
  7. http://www.mif.vu.lt/~algis/dsax/DsSort.pdf [ bare URL PDF ]

Further reading