Median Of Medians
   HOME

TheInfoList



OR:

In
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...
, the
median In statistics and probability theory, the median is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution. For a data set, it may be thought of as "the middle" value. The basic fe ...
of medians is an approximate (median)
selection algorithm In computer science, a selection algorithm is an algorithm for finding the ''k''th smallest number in a list or array; such a number is called the ''k''th ''order statistic''. This includes the cases of finding the minimum, maximum, and median e ...
, frequently used to supply a good pivot for an exact selection algorithm, mainly the
quickselect In computer science, quickselect is a selection algorithm to find the ''k''th smallest element in an unordered list. It is also known as the kth order statistics . It is related to the quicksort sorting algorithm. Like quicksort, it was devel ...
, that selects the ''k''th smallest element of an initially unsorted array. Median of medians finds an approximate median in linear time only, which is limited but an additional overhead for quickselect. When this approximate median is used as an improved pivot, the worst-case complexity of quickselect reduces significantly from quadratic to ''linear'', which is also the asymptotically optimal worst-case complexity of any selection algorithm. In other words, the median of medians is an approximate median-selection algorithm that helps building an asymptotically optimal, exact general selection algorithm (especially in the sense of worst-case complexity), by producing good pivot elements. Median of medians can also be used as a pivot strategy in
quicksort Quicksort is an efficient, general-purpose sorting algorithm. Quicksort was developed by British computer scientist Tony Hoare in 1959 and published in 1961, it is still a commonly used algorithm for sorting. Overall, it is slightly faster than ...
, yielding an optimal algorithm, with worst-case complexity O(n\log n). Although this approach optimizes the asymptotic worst-case complexity quite well, it is typically outperformed in practice by instead choosing random pivots for its average O(n) complexity for selection and average O(n\log n) complexity for sorting, without any overhead of computing the pivot. Similarly, Median of medians is used in the hybrid
introselect In computer science, introselect (short for "introspective selection") is a selection algorithm that is a hybrid of quickselect and median of medians which has fast average performance and optimal worst-case performance. Introselect is related ...
algorithm as a fallback for pivot selection at each iteration until kth smallest is found. This again ensures a worst-case linear performance, in addition to average-case linear performance: introselect starts with quickselect (with random pivot, default), to obtain good average performance, and then falls back to modified quickselect with pivot obtained from median of medians if the progress is too slow. Even though asymptotically similar, such a hybrid algorithm will have a lower complexity than a straightforward introselect up to a constant factor (both in average-case and worst-case), at any finite length. The algorithm was published in , and thus is sometimes called BFPRT after the last names of the authors. In the original paper the algorithm was referred to as PICK, referring to quickselect as "FIND".


Motivation

Quickselect is linear-time on average, but it can require quadratic time with poor pivot choices. This is because quickselect is a divide and conquer algorithm, with each step taking O(n) time in the size of the remaining search set. If the search set decreases exponentially quickly in size (by a fixed proportion), this yields a
geometric series In mathematics, a geometric series is the sum of an infinite number of terms that have a constant ratio between successive terms. For example, the series :\frac \,+\, \frac \,+\, \frac \,+\, \frac \,+\, \cdots is geometric, because each suc ...
times the O(n) factor of a single step, and thus linear overall time. However, if the search set decreases slowly in size, such as linearly (by a fixed number of elements, in the worst case only reducing by one element each time), then a linear sum of linear steps yields quadratic overall time (formally,
triangular number A triangular number or triangle number counts objects arranged in an equilateral triangle. Triangular numbers are a type of figurate number, other examples being square numbers and cube numbers. The th triangular number is the number of dots in ...
s grow quadratically). For example, the worst-case occurs when pivoting on the smallest element at each step, such as applying quickselect for the maximum element to already sorted data and taking the first element as pivot each time. If one instead consistently chooses "good" pivots, this is avoided and one always gets linear performance even in the worst case. A "good" pivot is one for which we can establish that a constant proportion of elements fall both below and above it, as then the search set decreases at least by a constant proportion at each step, hence exponentially quickly, and the overall time remains linear. The median is a good pivot – the best for sorting, and the best overall choice for selection – decreasing the search set by half at each step. Thus if one can compute the median in linear time, this only adds linear time to each step, and thus the overall complexity of the algorithm remains linear. The median-of-medians algorithm computes an approximate median, namely a point that is guaranteed to be between the 30th and 70th
percentile In statistics, a ''k''-th percentile (percentile score or centile) is a score ''below which'' a given percentage ''k'' of scores in its frequency distribution falls (exclusive definition) or a score ''at or below which'' a given percentage falls ...
s (in the middle 4
decile In descriptive statistics, a decile is any of the nine values that divide the sorted data into ten equal parts, so that each part represents 1/10 of the sample or population. A decile is one possible form of a quantile; others include the quartile ...
s). Thus the search set decreases by at least 30%. The problem is reduced to 70% of the original size, which is a fixed proportion smaller. Applying the same algorithm on the now smaller set recursively until only one or two elements remain results in a cost of \frac\approx3.33


Algorithm

As stated before, median-of-medians is used as a pivot selection strategy in the
quickselect In computer science, quickselect is a selection algorithm to find the ''k''th smallest element in an unordered list. It is also known as the kth order statistics . It is related to the quicksort sorting algorithm. Like quicksort, it was devel ...
algorithm, which in
pseudocode In computer science, pseudocode is a plain language description of the steps in an algorithm or another system. Pseudocode often uses structural conventions of a normal programming language, but is intended for human reading rather than machine re ...
looks as follows. Be careful to handle left, right and n when implementing. The following pseudocode assumes that left, right, and the list use one-based numbering and that select is initially called with 1 as the argument to left and the length of the list as the argument to right. Note that this returns the index of the n'th smallest number after rearranging the list, rather than the actual value of the n'th smallest number. function select(list, left, right, n) loop if left = right then return left pivotIndex := pivot(list, left, right) pivotIndex := partition(list, left, right, pivotIndex, n) if n = pivotIndex then return n else if n < pivotIndex then right := pivotIndex - 1 else left := pivotIndex + 1 Subroutine is the actual median-of-medians algorithm. It divides its input (a list of length ) into groups of at most five elements, computes the median of each of those groups using some subroutine, then
recursively Recursion (adjective: ''recursive'') occurs when a thing is defined in terms of itself or of its type. Recursion is used in a variety of disciplines ranging from linguistics to logic. The most common application of recursion is in mathematics ...
computes the true median of the \frac medians found in the previous step:. Note that calls ; this is an instance of
mutual recursion In mathematics and computer science, mutual recursion is a form of recursion where two mathematical or computational objects, such as functions or datatypes, are defined in terms of each other. Mutual recursion is very common in functional progra ...
. function pivot(list, left, right) ''// for 5 or less elements just get median'' if right − left < 5 then return partition5(list, left, right) ''// otherwise move the medians of five-element subgroups to the first n/5 positions'' for i from left to right in steps of 5 ''// get the median position of the i'th five-element subgroup'' subRight := i + 4 if subRight > right then subRight := right median5 := partition5(list, i, subRight) swap list edian5and list eft + floor((i − left)/5) ''// compute the median of the n/5 medians-of-five'' mid := (right − left) / 10 + left + 1 return select(list, left, left + floor((right − left) / 5), mid)


Partition helper functions

There is a subroutine called that can, in linear time, group a list (ranging from indices left to right) into three parts, those less than a certain element, those equal to it, and those greater than the element ( a three-way partition). The grouping into three parts ensures that the median-of-medians maintains linear execution time in a case of many or all coincident elements. Here is pseudocode that performs a partition about the element list ivotIndex/code>: function partition(list, left, right, pivotIndex, n) pivotValue := list ivotIndex swap list ivotIndexand list ight ''// Move pivot to end'' storeIndex := left ''// Move all elements smaller than the pivot to the left of the pivot'' for i from left to right − 1 do if list < pivotValue then swap list toreIndexand list increment storeIndex ''// Move all elements equal to the pivot right after'' ''// the smaller elements'' storeIndexEq = storeIndex for i from storeIndex to right − 1 do if list = pivotValue then swap list toreIndexEqand list increment storeIndexEq swap list ightand list toreIndexEq ''// Move pivot to its final place'' ''// Return location of pivot considering the desired location n'' if n < storeIndex then return storeIndex ''// n is in the group of smaller elements'' if n ≤ storeIndexEq then return n ''// n is in the group equal to pivot'' return storeIndexEq ''// n is in the group of larger elements'' The subroutine selects the median of a group of at most five elements; an easy way to implement this is
insertion sort Insertion sort is a simple sorting algorithm that builds the final sorted array (or list) one item at a time by comparisons. It is much less efficient on large lists than more advanced algorithms such as quicksort, heapsort, or merge sort. Ho ...
, as shown below. It can also be implemented as a
decision tree A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains condit ...
. function partition5( list, left, right) i := left + 1 while i ≤ right j := i while j > left and list
−1 In mathematics, −1 (also known as negative one or minus one) is the additive inverse of 1, that is, the number that when added to 1 gives the additive identity element, 0. It is the negative integer greater than negative two (−2) and less t ...
> list do swap list
−1 In mathematics, −1 (also known as negative one or minus one) is the additive inverse of 1, that is, the number that when added to 1 gives the additive identity element, 0. It is the negative integer greater than negative two (−2) and less t ...
and list j := j − 1 i := i + 1 return floor((left + right) / 2)


Properties of pivot

Of the \frac groups, half the number of groups (\frac\times\frac=\frac) have their median less than the pivot (Median of Medians). Also, another half the number of groups (again, \frac\times\frac=\frac) have their median greater than the pivot. In each of the \frac groups with median less than the pivot, there are two elements that are smaller than their respective medians, which are smaller than the pivot. Thus, each of the \frac groups have at least 3 elements that are smaller than the pivot. Similarly, in each of the \frac groups with median greater than the pivot, there are two elements that are greater than their respective medians, which are greater than the pivot. Thus, each of the \frac groups have at least 3 elements that are greater than the pivot. Hence, the pivot is less than 3\times\frac elements and greater than another 3\times\frac elements. Thus the chosen median splits the ordered elements somewhere between 30%/70% and 70%/30%, which assures worst-case linear behavior of the algorithm. To visualize: (red = "(one of the two possible) median of medians", gray = "number < red", white = "number > red") 5-tuples are shown here sorted by median, for clarity. Sorting the tuples is not necessary because we only need the median for use as pivot element. Note that all elements above/left of the red (30% of the 100 elements) are less, and all elements below/right of the red (another 30% of the 100 elements) are greater.


Proof of O(''n'') running time

The median-calculating recursive call does not exceed worst-case linear behavior because the list of medians has size \frac, while the other recursive call recurses on at most 70% of the list. Let T(n) be the time it takes to run a median-of-medians Quickselect algorithm on an array of size n. Then we know this time is: :T(n) \leq T(n/5) + T(n \cdot 7/10) + c \cdot n. where * the T\left(\frac\right) part is for finding the ''true'' median of the \frac medians, by running an (independent) Quickselect on them (since finding the median is just a special case of selecting a ''k''-smallest element) * the O(n) term c\cdot n is for the partitioning work to create the two sides, one of which our Quickselect will recurse (we visited each element a constant number of times in order to form them into n/5 groups and take each median in O(1) time). * the T\left(\frac\right) part is for the actual Quickselect recursion (for the worst case, in which the ''k''-th element is in the bigger partition that can be of size \frac maximally) By induction: :T(n) \leq 10 \cdot c \cdot n \in O(n).


Analysis

The key step is reducing the problem to selecting in two lists whose total length is shorter than the original list, plus a linear factor for the reduction step. This allows a simple induction to show that the overall running time is linear. The specific choice of groups of five elements is explained as follows. Firstly, computing median of an odd list is faster and simpler; while one could use an even list, this requires taking the average of the two middle elements, which is slower than simply selecting the single exact middle element. Secondly, five is the smallest odd number such that median of medians works. With groups of only three elements, the resulting list of medians to search in is length \frac, and reduces the list to recurse into length \fracn, since it is greater than 1/2 × 2/3 = 1/3 of the elements and less than 1/2 × 2/3 = 1/3 of the elements. Thus this still leaves n elements to search in, not reducing the problem sufficiently. The individual lists are shorter, however, and one can bound the resulting complexity to O(n \log n) by the
Akra–Bazzi method In computer science, the Akra–Bazzi method, or Akra–Bazzi theorem, is used to analyze the asymptotic behavior of the mathematical recurrences that appear in the analysis of divide and conquer algorithms where the sub-problems have substantia ...
, but it does not prove linearity. Conversely, one may instead group by g = seven, nine, or more elements, and this does work. This reduces the size of the list of medians to \frac'','' and the size of the list to recurse into asymptotes at 3''n''/4 (75%), as the quadrants in the above table approximate 25%, as the size of the overlapping lines decreases proportionally. This reduces the scaling factor from 10 asymptotically to 4, but accordingly raises the c term for the partitioning work. Finding the median of a larger group takes longer, but is a constant factor (depending only on g), and thus does not change the overall performance as ''n'' grows. In fact, considering the number of comparisons in the worst case, the constant factor is \frac. If one instead groups the other way, say dividing the n element list into 5 lists, computing the median of each, and then computing the median of these – i.e., grouping by a constant fraction, not a constant number – one does not as clearly reduce the problem, since it requires computing 5 medians, each in a list of \frac elements, and then recursing on a list of length at most \fracn. As with grouping by 3, the individual lists are shorter, but the overall length is no shorter – in fact longer – and thus one can only prove superlinear bounds. Grouping into a square of \sqrt lists of length \sqrt is similarly complicated.


References

* {{refend


External links

*
Lecture notes for January 30, 1996: Deterministic selection
, ''ICS 161: Design and Analysis of Algorithms,'' David Eppstein Selection algorithms