Reservoir sampling

Last updated

Reservoir sampling is a family of randomized algorithms for choosing a simple random sample, without replacement, of k items from a population of unknown size n in a single pass over the items. The size of the population n is not known to the algorithm and is typically too large for all n items to fit into main memory. The population is revealed to the algorithm over time, and the algorithm cannot look back at previous items. At any point, the current state of the algorithm must permit extraction of a simple random sample without replacement of size k over the part of the population seen so far.

Contents

Motivation

Suppose we see a sequence of items, one at a time. We want to keep 10 items in memory, and we want them to be selected at random from the sequence. If we know the total number of items n and can access the items arbitrarily, then the solution is easy: select 10 distinct indices i between 1 and n with equal probability, and keep the i-th elements. The problem is that we do not always know the exact n in advance.

Simple: Algorithm R

A simple and popular but slow algorithm, Algorithm R, was created by Jeffrey Vitter. [1]

Initialize an array indexed from to , containing the first k items of the input . This is the reservoir.

For each new input , generate a random number j uniformly in . If , then set Otherwise, discard .

Return after all inputs are processed.

This algorithm works by induction on .

Proof

When , Algorithm R returns all inputs, thus providing the basis for a proof by mathematical induction.

Here, the induction hypothesis is that the probability that a particular input is included in the reservoir just before the -th input is processed is and we must show that the probability that a particular input is included in the reservoir is just after the -th input is processed.

Apply Algorithm R to the -th input. Input is included with probability by definition of the algorithm. For any other input , by the induction hypothesis, the probability that it is included in the reservoir just before the -th input is processed is . The probability that it is still included in the reservoir after is processed (in other words, that is not replaced by ) is . The latter follows from the assumption that the integer is generated uniformly at random; once it becomes clear that a replacement will in fact occur, the probability that in particular is replaced by is .

We have shown that the probability that a new input enters the reservoir is equal to the probability that an existing input in the reservoir is retained. Therefore, we conclude by the principle of mathematical induction that Algorithm R does indeed produce a uniform random sample of the inputs.

While conceptually simple and easy to understand, this algorithm needs to generate a random number for each item of the input, including the items that are discarded. The algorithm's asymptotic running time is thus . Generating this amount of randomness and the linear run time causes the algorithm to be unnecessarily slow if the input population is large.

This is Algorithm R, implemented as follows:

(* S has items to sample, R will contain the result *)ReservoirSample(S[1..n],R[1..k])// fill the reservoir arrayfori:=1tokR[i]:=S[i]end// replace elements with gradually decreasing probabilityfori:=k+1ton(* randomInteger(a, b) generates a uniform integer from the inclusive range {a, ..., b} *)j:=randomInteger(1,i)ifj<=kR[j]:=S[i]endendend

Optimal: Algorithm L

If we generate random numbers independently, then the indices of the smallest of them is a uniform sample of the k-subsets of .

The process can be done without knowing :

Keep the smallest of that has been seen so far, as well as , the index of the largest among them. For each new , compare it with . If , then discard , store , and set to be the index of the largest among them. Otherwise, discard , and set .

Now couple this with the stream of inputs . Every time some is accepted, store the corresponding . Every time some is discarded, discard the corresponding .

This algorithm still needs random numbers, thus taking time. But it can be simplified.

First simplification: it is unnecessary to test new one by one, since the probability that the next acceptance happens at is , that is, the interval of acceptance follows a geometric distribution.

Second simplification: it's unnecessary to remember the entire array of the smallest of that has been seen so far, but merely , the largest among them. This is based on three observations:

This is Algorithm L, [2] which is implemented as follows:

(* S has items to sample, R will contain the result *)ReservoirSample(S[1..n],R[1..k])// fill the reservoir arrayfori=1tokR[i]:=S[i]end(* random() generates a uniform (0,1) random number *)W:=exp(log(random())/k)whilei<=ni:=i+floor(log(random())/log(1-W))+1ifi<=n(* replace a random item of the reservoir with item i *)R[randomInteger(1,k)]:=S[i]// random index between 1 and k, inclusiveW:=W*exp(log(random())/k)endendend

This algorithm computes three random numbers for each item that becomes part of the reservoir, and does not spend any time on items that do not. Its expected running time is thus , [2] which is optimal. [1] At the same time, it is simple to implement efficiently and does not depend on random deviates from exotic or hard-to-compute distributions.

With random sort

If we associate with each item of the input a uniformly generated random number, the k items with the largest (or, equivalently, smallest) associated values form a simple random sample. [3] A simple reservoir-sampling thus maintains the k items with the currently largest associated values in a priority queue.

(*  S is a stream of items to sample  S.Current returns current item in stream  S.Next advances stream to next position  min-priority-queue supports:    Count -> number of items in priority queue    Minimum -> returns minimum key value of all items    Extract-Min() -> Remove the item with minimum key    Insert(key, Item) -> Adds item with specified key *)ReservoirSample(S[1..?])H:=newmin-priority-queuewhileShasdatar:=random()// uniformly random between 0 and 1, exclusiveifH.Count<kH.Insert(r,S.Current)else// keep k items with largest associated keysifr>H.MinimumH.Extract-Min()H.Insert(r,S.Current)endS.NextendreturnitemsinHend

The expected running time of this algorithm is and it is relevant mainly because it can easily be extended to items with weights.

Weighted random sampling

The methods presented in the previous sections do not allow to obtain a priori fixed inclusion probabilities. Some applications require items' sampling probabilities to be according to weights associated with each item. For example, it might be required to sample queries in a search engine with weight as number of times they were performed so that the sample can be analyzed for overall impact on user experience. Let the weight of item i be , and the sum of all weights be W. There are two ways to interpret weights assigned to each item in the set: [4]

  1. In each round, the probability of every unselected item to be selected in that round is proportional to its weight relative to the weights of all unselected items. If X is the current sample, then the probability of an item to be selected in the current round is .
  2. The probability of each item to be included in the random sample is proportional to its relative weight, i.e., . Note that this interpretation might not be achievable in some cases, e.g., .

Algorithm A-Res

The following algorithm was given by Efraimidis and Spirakis that uses interpretation 1: [5]

(*  S is a stream of items to sample  S.Current returns current item in stream  S.Weight  returns weight of current item in stream  S.Next advances stream to next position  The power operator is represented by ^  min-priority-queue supports:    Count -> number of items in priority queue    Minimum() -> returns minimum key value of all items    Extract-Min() -> Remove the item with minimum key    Insert(key, Item) -> Adds item with specified key *)ReservoirSample(S[1..?])H:=newmin-priority-queuewhileShasdatar:=random()^(1/S.Weight)// random() produces a uniformly random number in (0,1)ifH.Count<kH.Insert(r,S.Current)else// keep k items with largest associated keysifr>H.MinimumH.Extract-Min()H.Insert(r,S.Current)endendS.NextendreturnitemsinHend

This algorithm is identical to the algorithm given in Reservoir Sampling with Random Sort except for the generation of the items' keys. The algorithm is equivalent to assigning each item a key where r is the random number and then selecting the k items with the largest keys. Equivalently, a more numerically stable formulation of this algorithm computes the keys as and select the k items with the smallest keys. [6] [ failed verification ]

Algorithm A-ExpJ

The following algorithm is a more efficient version of A-Res, also given by Efraimidis and Spirakis: [5]

(*  S is a stream of items to sample  S.Current returns current item in stream  S.Weight  returns weight of current item in stream  S.Next advances stream to next position  The power operator is represented by ^  min-priority-queue supports:    Count -> number of items in the priority queue    Minimum -> minimum key of any item in the priority queue    Extract-Min() -> Remove the item with minimum key    Insert(Key, Item) -> Adds item with specified key *)ReservoirSampleWithJumps(S[1..?])H:=newmin-priority-queuewhileShasdataandH.Count<kr:=random()^(1/S.Weight)// random() produces a uniformly random number in (0,1)H.Insert(r,S.Current)S.NextendX:=log(random())/log(H.Minimum)// this is the amount of weight that needs to be jumped overwhileShasdataX:=X-S.WeightifX<=0t:=H.Minimum^S.Weightr:=random(t,1)^(1/S.Weight)// random(x, y) produces a uniformly random number in (x, y)H.Extract-Min()H.Insert(r,S.Current)X:=log(random())/log(H.Minimum)endS.NextendreturnitemsinHend

This algorithm follows the same mathematical properties that are used in A-Res, but instead of calculating the key for each item and checking whether that item should be inserted or not, it calculates an exponential jump to the next item which will be inserted. This avoids having to create random variates for each item, which may be expensive. The number of random variates required is reduced from to in expectation, where is the reservoir size, and is the number of items in the stream. [5]

Algorithm A-Chao

Warning: the following description is wrong, see Chao's original paper and the discussion here.

Following algorithm was given by M. T. Chao uses interpretation 2: [7] and Tillé (2006). [8]

(*  S has items to sample, R will contain the result  S[i].Weight contains weight for each item *)WeightedReservoir-Chao(S[1..n],R[1..k])WSum:=0// fill the reservoir arrayfori:=1tokR[i]:=S[i]WSum:=WSum+S[i].Weightendfori:=k+1tonWSum:=WSum+S[i].Weightp:=S[i].Weight/WSum// probability for this itemj:=random();// uniformly random between 0 and 1ifj<=p// select item according to probabilityR[randomInteger(1,k)]:=S[i]//uniform selection in reservoir for replacementendendend

For each item, its relative weight is calculated and used to randomly decide if the item will be added into the reservoir. If the item is selected, then one of the existing items of the reservoir is uniformly selected and replaced with the new item. The trick here is that, if the probabilities of all items in the reservoir are already proportional to their weights, then by selecting uniformly which item to replace, the probabilities of all items remain proportional to their weight after the replacement.

Note that Chao doesn't specify how to sample the first k elements. He simple assumes we have some other way of picking them in proportion to their weight. Chao: "Assume that we have a sampling plan of fixed size with respect to S_k at time A; such that its first-order inclusion probability of X_t is π(k; i)".

Algorithm A-Chao with Jumps

Similar to the other algorithms, it is possible to compute a random weight j and subtract items' probability mass values, skipping them while j > 0, reducing the number of random numbers that have to be generated. [4]

(*  S has items to sample, R will contain the result  S[i].Weight contains weight for each item *)WeightedReservoir-Chao(S[1..n],R[1..k])WSum:=0// fill the reservoir arrayfori:=1tokR[i]:=S[i]WSum:=WSum+S[i].Weightendj:=random()// uniformly random between 0 and 1pNone:=1// probability that no item has been selected so far (in this jump)fori:=k+1tonWSum:=WSum+S[i].Weightp:=S[i].Weight/WSum// probability for this itemj-=p*pNonepNone:=pNone*(1-p)ifj<=0R[randomInteger(1,k)]:=S[i]//uniform selection in reservoir for replacementj=random()pNone:=1endendend

Applications for Multi-Class Fairness

In machine learning applications, fairness is a critical consideration, especially in scenarios where data streams exhibit class imbalance. To address this, Nikoloutsopoulos, Titsias, and Koutsopoulos proposed the Kullback-Leibler Reservoir Sampling (KLRS) algorithm as a solution to the challenges of Continual Learning, where models must learn incrementally from a continuous data stream.

KLRS Algorithm

The KLRS algorithm was designed to create a flexible policy that matches class percentages in the buffer to a target distribution while employing Reservoir Sampling techniques. This is achieved by minimizing the Kullback-Leibler (KL) divergence between the current buffer distribution and the desired target distribution.

KLRS generalizes earlier methods like Reservoir Sampling and Class-Balancing Reservoir Sampling, as verified by experiments using confidence intervals, demonstrating its broader applicability and improved performance.

Algorithm Description

The KLRS algorithm operates by maintaining a buffer of size and updating its contents as new data points arrive in a stream. Below is the pseudocode for the KLRS algorithm:

Algorithm: KLRS

KLRS(Stream,BufferSizeM,TargetDistribution)Input:*Stream(datapoints(x,y)arrivingsequentially)*BufferSizeM*TargetDistribution(desiredclassdistributioninthebuffer)Output:UpdatedbuffermaintainingthetargetdistributionInitializebufferMwiththefirstMdatapointsfromthestream.Fort=M+1to:-Updatestreamcounterstocomputen_{t+1}.-Computetargetdistributionp_{t+1}.-Foreachclasskinthebuffer:-Simulateremovingoneinstanceofclasskandadding(x_t,y_t).-ComputeKLdivergencev_kbetweenp_{t+1}andthemodifiedbufferdistribution.-Selecttheclassk*withtheminimumv_k.-Ifk*=y_t:-Generaterandomnumberrin[1,n_{t+1},y_t].-Ifrm_{t, y_t},-replacearandominstanceofclassy_tinthebufferwith(x_t,y_t).-Else,replacearandominstanceoftheselectedclassk*inthebufferwith(x_t,y_t).-Updatebuffercountersm_{t+1].    - Update class set C_{t+1}:=unique(C_t{y_t}).

Relation to Fisher–Yates shuffle

Suppose one wanted to draw k random cards from a deck of cards. A natural approach would be to shuffle the deck and then take the top k cards. In the general case, the shuffle also needs to work even if the number of cards in the deck is not known in advance, a condition which is satisfied by the inside-out version of the Fisher–Yates shuffle: [9]

(* S has the input, R will contain the output permutation *)Shuffle(S[1..n],R[1..n])R[1]:=S[1]forifrom2tondoj:=randomInteger(1,i)// inclusive rangeR[i]:=R[j]R[j]:=S[i]endend

Note that although the rest of the cards are shuffled, only the first k are important in the present context. Therefore, the array R need only track the cards in the first k positions while performing the shuffle, reducing the amount of memory needed. Truncating R to length k, the algorithm is modified accordingly:

(* S has items to sample, R will contain the result *)ReservoirSample(S[1..n],R[1..k])R[1]:=S[1]forifrom2tokdoj:=randomInteger(1,i)// inclusive rangeR[i]:=R[j]R[j]:=S[i]endforifromk+1tondoj:=randomInteger(1,i)// inclusive rangeif(j<=k)R[j]:=S[i]endendend

Since the order of the first k cards is immaterial, the first loop can be removed and R can be initialized to be the first k items of the input. This yields Algorithm R.

Limitations

Reservoir sampling makes the assumption that the desired sample fits into main memory, often implying that k is a constant independent of n. In applications where we would like to select a large subset of the input list (say a third, i.e. ), other methods need to be adopted. Distributed implementations for this problem have been proposed. [10]

Related Research Articles

<span class="mw-page-title-main">Huffman coding</span> Technique to compress data

In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code is Huffman coding, an algorithm developed by David A. Huffman while he was a Sc.D. student at MIT, and published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes".

In quantum computing, Grover's algorithm, also known as the quantum search algorithm, is a quantum algorithm for unstructured search that finds with high probability the unique input to a black box function that produces a particular output value, using just evaluations of the function, where is the size of the function's domain. It was devised by Lov Grover in 1996.

<span class="mw-page-title-main">Naive Bayes classifier</span> Probabilistic classification algorithm

In statistics, naive Bayes classifiers are a family of linear "probabilistic classifiers" which assumes that the features are conditionally independent, given the target class. The strength (naivety) of this assumption is what gives the classifier its name. These classifiers are among the simplest Bayesian network models.

<span class="mw-page-title-main">Order statistic</span> Kth smallest value in a statistical sample

In statistics, the kth order statistic of a statistical sample is equal to its kth-smallest value. Together with rank statistics, order statistics are among the most fundamental tools in non-parametric statistics and inference.

A randomized algorithm is an algorithm that employs a degree of randomness as part of its logic or procedure. The algorithm typically uses uniformly random bits as an auxiliary input to guide its behavior, in the hope of achieving good performance in the "average case" over all possible choices of random determined by the random bits; thus either the running time, or the output are random variables.

The information bottleneck method is a technique in information theory introduced by Naftali Tishby, Fernando C. Pereira, and William Bialek. It is designed for finding the best tradeoff between accuracy and complexity (compression) when summarizing a random variable X, given a joint probability distribution p(X,Y) between X and an observed relevant variable Y - and self-described as providing "a surprisingly rich framework for discussing a variety of problems in signal processing and learning".

<span class="mw-page-title-main">Quicksort</span> Divide and conquer sorting algorithm

Quicksort is an efficient, general-purpose sorting algorithm. Quicksort was developed by British computer scientist Tony Hoare in 1959 and published in 1961. It is still a commonly used algorithm for sorting. Overall, it is slightly faster than merge sort and heapsort for randomized data, particularly on larger distributions.

The cross-entropy (CE) method is a Monte Carlo method for importance sampling and optimization. It is applicable to both combinatorial and continuous problems, with either a static or noisy objective.

A stochastic simulation is a simulation of a system that has variables that can change stochastically (randomly) with individual probabilities.

In statistics, the Kendall rank correlation coefficient, commonly referred to as Kendall's τ coefficient, is a statistic used to measure the ordinal association between two measured quantities. A τ test is a non-parametric hypothesis test for statistical dependence based on the τ coefficient. It is a measure of rank correlation: the similarity of the orderings of the data when ranked by each of the quantities. It is named after Maurice Kendall, who developed it in 1938, though Gustav Fechner had proposed a similar measure in the context of time series in 1897.

In computer science, locality-sensitive hashing (LSH) is a fuzzy hashing technique that hashes similar input items into the same "buckets" with high probability. Since similar items end up in the same buckets, this technique can be used for data clustering and nearest neighbor search. It differs from conventional hashing techniques in that hash collisions are maximized, not minimized. Alternatively, the technique can be seen as a way to reduce the dimensionality of high-dimensional data; high-dimensional input items can be reduced to low-dimensional versions while preserving relative distances between items.

In computational geometry, the Bentley–Ottmann algorithm is a sweep line algorithm for listing all crossings in a set of line segments, i.e. it finds the intersection points of line segments. It extends the Shamos–Hoey algorithm, a similar previous algorithm for testing whether or not a set of line segments has any crossings. For an input consisting of line segments with crossings, the Bentley–Ottmann algorithm takes time . In cases where , this is an improvement on a naïve algorithm that tests every pair of segments, which takes .

A randomness extractor, often simply called an "extractor", is a function, which being applied to output from a weak entropy source, together with a short, uniformly random seed, generates a highly random output that appears independent from the source and uniformly distributed. Examples of weakly random sources include radioactive decay or thermal noise; the only restriction on possible sources is that there is no way they can be fully controlled, calculated or predicted, and that a lower bound on their entropy rate can be established. For a given source, a randomness extractor can even be considered to be a true random number generator (TRNG); but there is no single extractor that has been proven to produce truly random output from any type of weakly random source.

In computer science, streaming algorithms are algorithms for processing data streams in which the input is presented as a sequence of items and can be examined in only a few passes, typically just one. These algorithms are designed to operate with limited memory, generally logarithmic in the size of the stream and/or in the maximum value in the stream, and may also have limited processing time per item.

Samplesort is a sorting algorithm that is a divide and conquer algorithm often used in parallel processing systems. Conventional divide and conquer sorting algorithms partitions the array into sub-intervals or buckets. The buckets are then sorted individually and then concatenated together. However, if the array is non-uniformly distributed, the performance of these sorting algorithms can be significantly throttled. Samplesort addresses this issue by selecting a sample of size s from the n-element sequence, and determining the range of the buckets by sorting the sample and choosing p−1 < s elements from the result. These elements then divide the array into p approximately equal-sized buckets. Samplesort is described in the 1970 paper, "Samplesort: A Sampling Approach to Minimal Storage Tree Sorting", by W. D. Frazer and A. C. McKellar.

In cryptography, learning with errors (LWE) is a mathematical problem that is widely used to create secure encryption algorithms. It is based on the idea of representing secret information as a set of equations with errors. In other words, LWE is a way to hide the value of a secret by introducing noise to it. In more technical terms, it refers to the computational problem of inferring a linear -ary function over a finite ring from given samples some of which may be erroneous. The LWE problem is conjectured to be hard to solve, and thus to be useful in cryptography.

<span class="mw-page-title-main">Group testing</span> Procedure that breaks up the task of identifying certain objects into tests on groups of items

In statistics and combinatorial mathematics, group testing is any procedure that breaks up the task of identifying certain objects into tests on groups of items, rather than on individual ones. First studied by Robert Dorfman in 1943, group testing is a relatively new field of applied mathematics that can be applied to a wide range of practical applications and is an active area of research today.

In computer science and data mining, MinHash is a technique for quickly estimating how similar two sets are. The scheme was published by Andrei Broder in a 1997 conference, and initially used in the AltaVista search engine to detect duplicate web pages and eliminate them from search results. It has also been applied in large-scale clustering problems, such as clustering documents by the similarity of their sets of words.

<span class="mw-page-title-main">Matrix completion</span>

Matrix completion is the task of filling in the missing entries of a partially observed matrix, which is equivalent to performing data imputation in statistics. A wide range of datasets are naturally organized in matrix form. One example is the movie-ratings matrix, as appears in the Netflix problem: Given a ratings matrix in which each entry represents the rating of movie by customer , if customer has watched movie and is otherwise missing, we would like to predict the remaining entries in order to make good recommendations to customers on what to watch next. Another example is the document-term matrix: The frequencies of words used in a collection of documents can be represented as a matrix, where each entry corresponds to the number of times the associated term appears in the indicated document.

In computer science, the count-distinct problem (also known in applied mathematics as the cardinality estimation problem) is the problem of finding the number of distinct elements in a data stream with repeated elements. This is a well-known problem with numerous applications. The elements might represent IP addresses of packets passing through a router, unique visitors to a web site, elements in a large database, motifs in a DNA sequence, or elements of RFID/sensor networks.

References

  1. 1 2 Vitter, Jeffrey S. (1 March 1985). "Random sampling with a reservoir" (PDF). ACM Transactions on Mathematical Software. 11 (1): 37–57. CiteSeerX   10.1.1.138.784 . doi:10.1145/3147.3165. S2CID   17881708.
  2. 1 2 Li, Kim-Hung (4 December 1994). "Reservoir-Sampling Algorithms of Time Complexity O(n(1+log(N/n)))". ACM Transactions on Mathematical Software. 20 (4): 481–493. doi: 10.1145/198429.198435 . S2CID   15721242.
  3. Fan, C.; Muller, M.E.; Rezucha, I. (1962). "Development of sampling plans by using sequential (item by item) selection techniques and digital computers". Journal of the American Statistical Association. 57 (298): 387–402. doi:10.1080/01621459.1962.10480667. JSTOR   2281647.
  4. 1 2 Efraimidis, Pavlos S. (2015). "Weighted Random Sampling over Data Streams". Algorithms, Probability, Networks, and Games. Lecture Notes in Computer Science. Vol. 9295. pp. 183–195. arXiv: 1012.0256 . doi:10.1007/978-3-319-24024-4_12. ISBN   978-3-319-24023-7. S2CID   2008731.
  5. 1 2 3 Efraimidis, Pavlos S.; Spirakis, Paul G. (2006-03-16). "Weighted random sampling with a reservoir". Information Processing Letters. 97 (5): 181–185. doi:10.1016/j.ipl.2005.11.003.
  6. Arratia, Richard (2002). Bela Bollobas (ed.). "On the amount of dependence in the prime factorization of a uniform random integer". Contemporary Combinatorics. 10: 29–91. CiteSeerX   10.1.1.745.3975 . ISBN   978-3-642-07660-2.
  7. Chao, M. T. (1982). "A general purpose unequal probability sampling plan". Biometrika. 69 (3): 653–656. doi:10.1093/biomet/69.3.653.
  8. Tillé, Yves (2006). Sampling Algorithms. Springer. ISBN   978-0-387-30814-2.
  9. National Research Council (2013). Frontiers in Massive Data Analysis. The National Academies Press. p. 121. ISBN   978-0-309-28781-4.
  10. Reservoir Sampling in MapReduce