Reservoir sampling is a family of
randomized algorithms for choosing a
simple random sample, without replacement, of items from a population of unknown size in a single pass over the items. The size of the population is not known to the algorithm and is typically too large for all items to fit into
main memory. The population is revealed to the algorithm over time, and the algorithm cannot look back at previous items. At any point, the current state of the algorithm must permit extraction of a simple random sample without replacement of size over the part of the population seen so far.
Motivation
Suppose we see a sequence of items, one at a time. We want to keep 10 items in memory, and we want them to be selected at random from the sequence. If we know the total number of items and can access the items arbitrarily, then the solution is easy: select 10 distinct indices between 1 and with equal probability, and keep the -th elements. The problem is that we do not always know the exact in advance.
Simple: Algorithm R
A simple and popular but slow algorithm, ''Algorithm R'', was created by Jeffrey Vitter.
Initialize an array indexed from to , containing the first items of the input . This is the ''reservoir''.
For each new input , generate a random number uniformly in . If , then set Otherwise, discard .
Return after all inputs are processed.
This algorithm works by induction on
.
While conceptually simple and easy to understand, this algorithm needs to generate a random number for each item of the input, including the items that are discarded. The algorithm's
asymptotic running time is thus
. Generating this amount of randomness and the linear run time causes the algorithm to be unnecessarily slow if the input population is large.
This is ''Algorithm R,'' implemented as follows:
(* S has items to sample, R will contain the result *)
ReservoirSample(S ..n R ..k
// fill the reservoir array
for i := 1 to k
R := S end
// replace elements with gradually decreasing probability
for i := k+1 to n
(* randomInteger(a, b) generates a uniform integer from the inclusive range *)
j := randomInteger(1, i)
if j <= k
R := S end
end
end
Optimal: Algorithm L
If we generate
random numbers