## Taking Random Samples

Posted by Eyal Schneider on April 1, 2010

Sometimes we need to take a random sample from a given collection of objects. This is useful in simulations, games, and statistical applications. In this post I discuss some efficient algorithms for choosing m items out of n available, such that the m items are chosen at random, with equal probability for all possible subsets of size m. For each algorithm I present the code in Java, as well as a formal analysis of its performance and correctness, where needed. As you may have already noticed from the introduction, this post is more theoretic than usual…

A somehow related problem is the problem of generating a random permutation of given collection. While Java provides an efficient utility method for that purpose (See Collections.shuffle(..), based on the Knuth Shuffle algorithm), there is no built-in utility for generating random subsets.

Before we start discussing some known algorithms, it is important to note that there are many variations of the problem, depending on the following parameters:

- Is the collection random access (e.g. ArrayList)?
- Is the collection read only?
- Do we allow random running time?
- Is N known at invocation time, or are we dealing with a stream of items of unknown size?

## Trial and Error

We will start with the simplest approach. Assuming that the collection to take the sample from is random access, we can repeatedly add random items from the collection to our result set, until the set contains m different items. As an optimization, when m > n/2, we can choose (n-m) items instead of m, and then return the rest:

public static <T> Set<T> randomSample1(List<T> items, int m){ HashSet<T> res = new HashSet<T>(); int n = items.size(); if (m > n/2){ // The optimization Set<T> negativeSet = randomSample1(items, n-m); for(T item : items){ if (!negativeSet.contains(item)) res.add(item); } }else{ // The main loop while(res.size() < m){ int randPos = rnd.nextInt(n); res.add(items.get(randPos)); } } return res; }

Clearly, the number of iterations in the main loop is not bounded. However, we can compute the *expected* number of iterations, what makes this algorithm a Las Vegas algorithm. The iterations can be split into m different sequences (numbered 0 to m-1), where sequence i refers to the attempts needed for adding the (i+1)-th item into the result set. By observing that the length of sequence i has a geometric distribution defined by p=(n-i)/n, we can calculate the expected number of iterations as follows:

And lets also define

If we examine E(m,n) as a function of m only, in the interval to , it can be easily verified that the function is increasing and convex. Therefore we have:

In other words the expected time complexity is linear in m, which is optimal. This is a little surprising, given the naive nature of the algorithm, and it results from our optimization.

## Swapping

If our collection is random access and its items can be freely reordered, then we can efficiently draw random items one by one from a candidates set, containing only items not chosen so far. By swapping items, we can guarantee that the candidates set is always contiguous . As a matter of fact, this algorithm is a bounded version of the Knuth Shuffle algorithm for generating random permutations, and its correctness is trivial.

public static <T> List<T> randomSample2(List<T> items, int m){ for(int i=0;i<m;i++){ int pos = i + rnd.nextInt(items.size() - i); T tmp = items.get(pos); items.set(pos, items.get(i)); items.set(i, tmp); } return items.subList(0, m); }

## Full Scan

Sometimes our collection is not random access, so we have no choice but to traverse it completely in the worst case. Following is an elegant solution, that iterates once through the items, and selects an item with probability (#remaining to select)/(#remaining to scan):

public static <T> List<T> randomSample3(List<T> items, int m){ ArrayList<T> res = new ArrayList<T>(m); int visited = 0; Iterator<T> it = items.iterator(); while (m > 0){ T item = it.next(); if (rnd.nextDouble() < ((double)m)/(items.size() - visited)){ res.add(item); m--; } visited++; } return res; }

Clearly, the running time is O(n), which is optimal given the constraints.

It is left to prove that for any collection C and any subset S, the chances of generating S are . We can do it by induction on the size of C.

For |C|=1 and |S| between 0 and 1 this is trivial. Now, let C be an ordered collection and let S be a subset of C. When the algorithm starts traversing C, it first encounters item v. If , we should choose it and then proceed by choosing items S-{v} from the collection C-{v}. We can use our induction assumption to calculate the probability of this:

In either cases we get the required probability.

## Floyd’s Algorithm

What happens if we don’t want the time complexity to be dependent on N, and the collection is read only?

In this case, assuming the collection is random access, we can use Floyd’s algorithm, which is both brilliant and easy to implement. It iterates with variable i through the last m indexes of the collection, and on each iteration adds a single item from the range 0..i, with a non-uniform distribution:

public static <T> Set<T> randomSample4(List<T> items, int m){ HashSet<T> res = new HashSet<T>(m); int n = items.size(); for(int i=n-m;i<n;i++){ int pos = rnd.nextInt(i+1); T item = items.get(pos); if (res.contains(item)) res.add(items.get(i)); else res.add(item); } return res; }

The time complexity is O(m) on the average, but we can bound the worst case time by O(m log(m)) if we use a balanced tree instead of a hash set.

The correctness follows by proving the following loop invariant: After completing an iteration with i=j, the set *res *contains a uniformly distributed random subset of size j-n+m+1 of the items in the range 0..j. We will prove this by induction on i. For the initial value of i (i=n-m), we simply choose a random position in the range 0..i, so it is also a random subset of size 1 of the same range. Now, let i=j (>n-m), and let S be any subset of size j-n+m+1 from the range 0..j. If items[j] is in S, then the previous iterations must have completed with res=S-{items[j]}, and then either position j or any previously selected position should be chosen. Using the induction assumption, we can compute the probability of obtaining S as follows:

If on the other hand items[j] is not in S, then we have many options of selecting |S|-1 items in the previous iterations, and then we should choose the remaining index:

In both cases we have a uniform probability for choosing |S| items from j+1 candidates, and that completes the proof.

## Stream of items

Sometimes we don’t know the collection size in advance. We can only iterate through it, as if it was a data stream with unknown size. The following algorithm (Known as Reservoir sampling) performs only one pass on the input stream. While iterating, it maintains a set of items that represents a random subset (of size m) of the items visited so far. It starts by selecting the first m items, and then it selects the k-th item with probability m/k. If the item is selected, it replaces a randomly chosen item from the previously selected ones:

public static <T> List<T> reservoirSample(Iterable<T> items, int m){ ArrayList<T> res = new ArrayList<T>(m); int count = 0; for(T item : items){ count++; if (count <= m) res.add(item); else{ int r = rnd.nextInt(count); if (r < m) res.set(r, item); } } return res; }

Each iteration consumes constant time, therefore the algorithm runs in O(n) time, where n is the stream length.

The correctness can be proved by induction on the stream size. We want to prove that for every stream T and every value of m (m<=|T|), all subsets of T of size m are equally likely to be returned. The base case is a stream of size m. In this case we have only one possible subset, and the algorithm always returns it. Now assume that we have a given stream T, and we know that the induction property holds for stream lengths below |T|. Let S be a subset of T. We shall inspect the last item v in the stream.

If v is not in S, then we must have chosen all items of S by the time we completed the previous iteration. We should also reject the last item:

If on the other hand v is in S, then we should have the other |S|-1 items of the sample in the previous iteration already (plus an extra item among the |T|-|S| possible ones), and then we should choose v and make it replace the extra item:

In both cases we get the required probability.

## A note about random number generators

The algorithms above assume that the random number generator has a pure random behavior, but this is rarely the case. The class java.util.Random uses a very popular pseudo number generation approach, called Linear Congruential Generator. Due to the fact that every generated number is determined by the previously generated one (or initially the seed), there is a bounded number of sequences that can be produced. Java’s Random class uses a state record of 48 bits, so we can never exceed different sequences. This has a huge impact when generating random constructs from a large combinatorial space, such as in the case of subsets or permutations. For example, if we are interested in generating a random subset of size 20 from a set of size 60, we have options, which exceeds . Therefore, the results of any algorithm that uses java.lang.Random would be completely biased because there are many possible subsets that will never be returned. Actually, we will cover only 7% of the subsets space! For many applications this behavior is good enough. For others, which require more accuracy, we should consider a different random source than java.util.Random. After searching the web a little, I found the RngPack library, which seems to have some powerful random number generator implementations.

### Sources

http://delivery.acm.org/10.1145/320000/315746/p754-bentley.pdf?key1=315746&key2=0487028621&coll=GUIDE&dl=GUIDE&CFID=79451466&CFTOKEN=34515112 http://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle http://en.wikipedia.org/wiki/Linear_congruential_generator http://comjnl.oxfordjournals.org/cgi/reprint/25/1/45.pdf http://comjnl.oxfordjournals.org/cgi/reprint/28/4/412.pdf http://www.jstor.org/stable/pdfplus/2347297.pdf http://stackoverflow.com/questions/48087/select-a-random-n-elements-from-listt-in-c# http://stackoverflow.com/questions/136474/best-way-to-pick-a-random-subset-from-a-collection

## Seva said

Very useful summary, thanks.

## dracodoc said

Just wondering about the biased problem mentioned in last paragraph. How about this method: suppose you can use n bits to present the combination space of {60 \choose 20}, let’s say 2**200, then you can generate a 200bit number by random chose each bit. So you made 200 random choices, and as long as each choice have even distribution between 0 and 1, the final result should have even distribution in whole combination space.

## Peter said

Hi thanks for the post, it actually came up as an interview question for me recently. Looking at your code examples, you don’t seem to declare

rnd, the first reference to it is:rnd.nextInt(n)## Eyal Schneider said

Hi, I hope you did well in the interview :)

Indeed, the reference rnd is not declared in my code snippets. I assume that it is already defined as a data member (e.g. private static Random rnd = new Random();)

## Dan said

In the first algorithm, if the subset is dense enough we would be doing a for loop on the original array, so in this case it’s always linear in n. Is there something I’m missing ? By the way, thanks for writing this article ;)

## Eyal Schneider said

Your argument only shows that in the worst case, the complexity is at least linear.

However, since this is a random algorithm, there isn’t much sense in examining the worst case (Any big number of iterations you can think of has a non zero probability of occurring). Therefore, we are interested in the average running time (or “Expected time”). The proof I present shows that the average running time is linear in m (much better than linear in n).

## roypaterson said

The “Swapping” algorithm’s correctness may be “trivial” but I think your implementation is wrong! You only need to iterate m times, not items.size().

## Eyal Schneider said

You are right – there’s no need to iterate over the complete array. Fixed it, thanks!

## sboby said

Thanks! Very good work!

## Dror A. said

One question about the full scan algorithm: I fail to understand why the subset `S` is necessarily filled with the `m` elements? In theory it might happen that you never enter the `if` statement and thus `m` is never decreased. However, in this scenario you will continue to iterate over the collection and you will reach its end before filling `S`. Did I miss something? Could you please clarify how you asserts that `S` will have `m` elements at the end?

BTW: +1 for a great and concise post!

## Eyal Schneider said

Note that the probability to accept an item (#remaining to choose / #remaining) varies as you progress with the iteration. If m stops decreasing at some point, the iteration eventually reaches a position where the remaining items count (items.size() – visited) equals m. This makes the ratio 1.0, therefore all subsequent inclusion tests will be successful, and m will decrease accordingly until it reaches 0.

## Dror A. said

Thanks for the quick reply! I noticed that point, and I assumed this is the key. However, it seems like it introduces another problem in terms of the even distribution? If `items.size()-visited` equals `m` then the missing elements will be picked sequentially until the quota is filled. Isn’t it harming the uniform distribution?

## Eyal Schneider said

The distribution is still even. The short proof in the post shows that any subset of size m has the same chance to be chosen.

Even the extreme case where the last m items are chosen has the same probability; it requires the first n-m items not to be chosen, so the probability is:

((n-m)/n) * ((n-1-m)/(n-1)) * ((n-2-m)/(n-2)) * … * (1/(m + 1)) = (n-m)! / (n! / m!) = 1 / C(n,m)

as required.

## huangjianyu said

Reblogged this on Always feel lonely.

## Anne van Rossum said

In Floyd’s algorithm if m=n, you have a problem.

## Eyal Schneider said

Why? it will always return the complete list, as expected.