It’s been more decades than I care to admit since I’ve had statistics, and I’d like to ask for some assistance with some counting and odds problems. (For my own edification. These aren’t homework problems.)
I remember the formula for “N choose M, without replacement”: N! / (M! (N - M)!)
But now, let’s assume that you choose another set without replacement, say P, from the original N. I have some questions about the relationships between M and P.
It always helps me to give actual numbers, so let’s say …
N is 20 (the original set contains 20 items),
M is 2 (2 elements are chosen, without replacement), and
P is 5 (5 new elements are chosen, without replacement, from the original 20).
What are the formulas for these:
How many elements from M would you “expect” to see in P?
2a. What are the odds of finding exactly 0 elements from M in P?
2b. What are the odds of finding exactly 1 element from M in P?
2c. What are the odds of finding exactly 2 elements from M in P?
…
2x. What are the odds of finding all elements from M in P?
3a. How big would P have to be for you to expect that P contains exactly 1 element from M?
3b. How big would P have to be for you to expect that P contains exactly 2 elements from M?
…
3x. How big would P have to be for you to expect that P contains all elements from M?
Thanks,
J.
Yes, M elements are chosen first (without replacement). Then those elements are replaced back into N. Then P elements are chosen (without replacement) from the original 20 elements in N.
I will write “n choose k” as C(n, k). In the simplest version of the hypergeometric distribution, you have a population of N items that can be split into two non-overlapping groups of size M and N - M. You want to sample without replacement from the overall population, and you want to compute the odds of drawing a certain portion of that sample from the first population. If you sample P times, the probability that k of those draws come from the first population is C(M, k)C(N - M, P - k)/C(N, P).
The number of elements from the first population you expect to see in your sample is the mean, which is equal to PM/N. For your example, that’s 5*2/20, or 1/2. You can use that formula to find how many times you have to sample to expect to see a certain number of elements from the first population. It’s instructive to compute how large a sample you have to take to expect to see the entire first population, so I’ll let you work that one out.
It is indeed the hypergeometric distribution. In the first draw you are essentially selecting m “defective” units out of the N available. Then in your second draw of n units you are looking for these same m units. So N and m are obviously 20 and 2, respectivly, in your example. So what should n and k be in the wiki article? I get 1/19 as the answer.