Ordinary merge sort
APolyphase merge
For ''N'' < 8 working files, a polyphase merge sort achieves a higher effective run count reduction factor by unevenly distributing sorted runs between ''N''−1 working files (explained in next section). Each iteration merges runs from ''N''−1 working files onto a single output working file. When the end of one of the ''N''−1 working files is reached, then it becomes the new output file and what was the output file becomes one of the ''N''−1 working input files, starting a new iteration of polyphase merge sort. Each iteration merges only a fraction of the dataset (about 1/2 to 3/4), except for the last iteration which merges all of the dataset into a single sorted run. The initial distribution is set up so that only one input working file is emptied at a time, except for the final merge iteration which merges ''N''−1 single runs (of varying size, this is explained next) from the ''N''−1 input working files to the single output file, resulting in a single sorted run, the sorted dataset. For each polyphase iteration, the total number of runs follows a pattern similar to a reversed Fibonacci numbers of higher order sequence. With 4 files, and a dataset consisting of 57 runs, the total run count on each iteration would be 57, 31, 17, 9, 5, 3, 1.Perfect 3 file polyphase merge sort
It is easiest to look at the polyphase merge starting from its ending conditions and working backwards. At the start of each iteration, there will be two input files and one output file. At the end of the iteration, one input file will have been completely consumed and will become the output file for the next iteration. The current output file will become an input file for the next iteration. The remaining files (just one in the 3 file case) have only been partially consumed and their remaining runs will be input for the next iteration. File 1 just emptied and became the new output file. One run is left on each input tape, and merging those runs together will make the sorted file.File 1 (out): <1 run> * (the sorted file) File 2 (in ): ... , <1 run> * --> ... <1 run> , * (consumed) File 3 (in ): , <1 run> * <1 run> , * (consumed) ... possible runs that have already been read , marks the read pointer of the file * marks end of fileStepping back to the previous iteration, we were reading from 1 and 2. One run is merged from 1 and 2 before file 1 goes empty. Notice that file 2 is not completely consumed—it has one run left to match the final merge (above).
File 1 (in ): ... , <1 run> * ... <1 run> , * File 2 (in ): , <2 run> * --> <1 run> , <1 run> * File 3 (out): <1 run> *Stepping back another iteration, 2 runs are merged from 1 and 3 before file 3 goes empty.
File 1 (in ): , <3 run> ... <2 run> , <1 run> * File 2 (out): --> <2 run> * File 3 (in ): ... , <2 run> * <2 run> , *Stepping back another iteration, 3 runs are merged from 2 and 3 before file 2 goes empty.
File 1 (out): <3 run> * File 2 (in ): ... , <3 run> * --> ... <3 run> , * File 3 (in ): , <5 run> * <3 run> , <2 run> *Stepping back another iteration, 5 runs are merged from 1 and 2 before file 1 goes empty.
File 1 (in ): ... , <5 run> * ... <5 run> , * File 2 (in ): , <8 run> * --> <5 run> , <3 run> * File 3 (out): <5 run> *
Distribution for polyphase merge sort
Looking at the perfect 3 file case, the number of runs for merged working backwards: 1, 1, 2, 3, 5, ... reveals a Fibonacci sequence. The sequence for more than 3 files is a bit more complicated; for 4 files, starting at the final state and working backwards, the run count pattern is , , , , , , , ... . For everything to work out optimally, the last merge phase should have exactly one run on each input file. If any input file has more than one run, then another phase would be required. Consequently, the polyphase merge sort needs to be clever about the initial distribution of the input data's runs to the initial output files. For example, an input file with 13 runs would write 5 runs to file 1 and 8 runs to file 2. In practice, the input file will not have the exact number of runs needed for a perfect distribution. One way to deal with this is by padding the actual distribution with imaginary "dummy runs" to simulate an ideal run distribution. A dummy run behaves like a run with no records in it. Merging one or more dummy runs with one or more real runs just merges the real runs, and merging one or more dummy runs with no real runs results in a single dummy run. Another approach is to emulate dummy runs as needed during the merge operations. "Optimal" distribution algorithms require knowing the number of runs in advance. Otherwise, in the more common case where the number of runs is not known in advance, "near optimal" distribution algorithms are used. Some distribution algorithms include rearranging runs. If the number of runs is known in advance, only a partial distribution is needed before starting the merge phases. For example, consider the 3 file case, starting with ''n'' runs in File_1. Define ''Fi'' = ''F''''i''−1 + F''i''−2 as the ''i''thComparison versus ordinary merge sort
After the initial distribution, an ordinary merge sort using 4 files will sort 16 single record runs in 4 iterations of the entire dataset, moving a total of 64 records in order to sort the dataset after the initial distribution. A polyphase merge sort using 4 files will sort 17 single record runs in 4 iterations, but since each iteration but the last iteration only moves a fraction of the dataset, it only moves a total of 48 records in order to sort the dataset after the initial distribution. In this case, ordinary merge sort factor is 2.0, while polyphase overall factor is ≈2.73. To explain how the reduction factor is related to sort performance, the reduction factor equations are: reduction_factor = exp(number_of_runs*log(number_of_runs)/run_move_count) run_move_count = number_of_runs * log(number_of_runs)/log(reduction_factor) run_move_count = number_of_runs * log_reduction_factor(number_of_runs) Using the run move count equation for the above examples: * ordinary merge sort → , * polyphase merge sort → . Here is a table of effective reduction factors for polyphase and ordinary merge sort listed by number of files, based on actual sorts of a few million records. This table roughly corresponds to the reduction factor per dataset moved tables shown in fig 3 and fig 4 o# files , average fraction of data per iteration , , polyphase reduction factor on ideal sized data , , , ordinary reduction factor on ideal sized data , , , , 3 .73 1.94 1.41 (sqrt 2) 4 .63 2.68 2.00 5 .58 3.20 2.45 (sqrt 6) 6 .56 3.56 3.00 7 .55 3.80 3.46 (sqrt 12) 8 .54 3.95 4.00 9 .53 4.07 4.47 (sqrt 20) 10 .53 4.15 5.00 11 .53 4.22 5.48 (sqrt 30) 12 .53 4.28 6.00 32 .53 4.87 16.00In general, polyphase merge sort is better than ordinary merge sort when there are fewer than 8 files, while ordinary merge sort starts to become better at around 8 or more files.http://www.mif.vu.lt/~algis/dsax/DsSort.pdf
References
Further reading
* * *External links
{{DEFAULTSORT:Polyphase merge sort Sorting algorithms Comparison sorts Online sorts