Probability of getting items 1-N when selecting from population of 1-M (M>N)

This sounds kind of like homework, but it’s not. Some backstory on how this came up: A friend and I were talking about some questions related to a language puzzle that had been presented to him as a coding problem for an interview.

One question that came out of this was: “How long an alphabetized list of random english words do you need to have to figure out the order of the alphabet?”. In case anyone is curious about that question: I wrote a Monte Carlo simulation of this and you get to 99+% with a list of 2500 words (from the dictionary I was using)

But a sort of related question came to mind, that feels like it ought to have a nice closed form answer, but I can’t figure it out.

Each pair of adjacent words in an alphabetization list gives you at most one partial ordering. For example, (bat, zebra) gives you “b is earlier than z” and (alpha, alpine) gives you “h is earlier than i”. In order to figure out the ordering, you need to end up looking at words that give you the 25 orderings for letters right next to each other: (a,b), (b,c), (c,d)…, out of all possible 26*25/2 = 325 orderings.

So, if we were selecting, not actual words, but just orderings, randomly, how many should we have to select to get, say, a 99% chance that we’ve

Or, more generally, how many selections do I have to make from 1-M (325) to be X% (99) sure that I’ll have gotten (1-N) (25)?

Based on my simulation and the fact that letters are not randomly distributed in English, I expect the answer to be less than 2500. Maybe around 1000?

:sigh: I need to proofread better:

…seen the necessary 25.

For what it’s worth, this is a generalization of the famous “coupon collector’s problem”; see’s_problem.

This is one of those problems where it’s very easy to figure out the average number of drawings you’ll need, but other stats may be harder. In your N=25, M=325 example the average number needed is
325/25 + 325/24 + 325/23 + … + 325/1 ≈ 1240.2 ≈ 325 * (ln(25) + 0.5772)
This is easy to see. 25/325 is the chance of a successful first drawing, so 325/25 is the average delay until that success. With only 24 desirable draws (instead of 25) after the first success, the 2nd success gets a little less likely, and so on.

But iamthewalrus(:3= didn’t ask for the average until success; he wants the number to get 99% chance. I don’t know if I can figure that out analytically—even if we calculated the variance, the distributions are skewed—, but if a very crude guess is good enough, just double the 1240 above to get 2480 — that will give roughly a 99% success chance. Multiply by something less than 2 when N is much bigger than 25; multiply by 2.5 or more if N is much less than 25.

Why is 25/325 the chance of a successful first drawing? It seems any first drawing is somewhat “successful”, though they do not obviously each convey the same amount of information.

I consider the abstract problem of 25 distinguishable but equally desirable red balls in an urn of 325 balls, all equally likely to be drawn.

OP said he’d already developed, via simulation, an answer to the more complicated problem; and now wanted a closed-form solution to the simpler abstract problem.

It seems to me that we could approach it this way:
We have 325 balls, 300 of which are green, numbered 1-300 and 25 of which are red, numbered 301-325.
We will draw one ball and then replace it, N times.
For any value of N, there are finitely many possible outcomes. Let’s call that O. For example, with N=1, O=325 possible outcomes, 25 of which yield a single red ball, and none of which yield all 25 balls. Let n be the number of outcomes which yield all 25 of the red balls. The probability of success is then n/O. Clearly, when N<25, n=0. When N=25, O=325^25 and n=25!. Hence, the probability of success with N=25 is 25!/325^25. We could then calculate n/O for each value of N and find the tipping point where n/O>99%.
The problem seems to be that we don’t have an easy way of looking at something like 25!/325^25 and knowing if it is < or >99%. I suspect it would take quite a bit of computing power just to calculate O for large values of N like 2,000. I know my calculator can’t handle 325^2000.

I don’t have time to do it right now, but you can probably simplify that calculation using Stirling’s Approximation, and thus get it down to something that calculators can handle.

It isn’t hard to solve
a!/b^c > 0.99
This is approximately
c > (.93 + (a+0.5) * ln(a) - a) / ln(b)

The problem is that your probability for success at N= 25
25! / 325^25
rapidly becomes more complicated. For N=26 I think it’s
(26! * 300 + 25! * 13 * 25) / 325^26
… not too bad. But N=27 would already be tedious to write down by hand; and we’ll need N ~ 2550 for OP’s problem.
Is there a nice approximation for the number of successful N-samplings? Probably, but it isn’t apparent to me.

Wolfram Alpha

Here it is as a fraction in lowest terms…

Here it is with way more digits calculated than necessary…
2.4749227873675860041365178605855532004407028321391619… × 10^-38

It’s true that the other 300 provide some information, they don’t in total provide enough information to determine the ordering.

The simplest way to envision this is to imagine you figured out all the partial orderings except “a is followed by b” (which is one of the critical 25). You’ll know the absolute ordering of every letter but the first two, but all you can say is that the first two letters are a and b, in an unknown ordering. 324 out of 325 of the orderings is not sufficient to solve the problem. You need the critical 25.

So, while the other orderings provide some information about the ordering of the alphabet, that information is both not necessary (if you have the critical 25), and insufficient to solve the problem.

Interesting answers from all. I guess I’m a little glad that it’s not as simple as I thought.

For what it’s worth, it looks like the answer (via Monte Carlo simulation) looks like it’s actually slightly higher than the actual language problem I set out to simplify. The number of word samples you need is somewhere between 2400 and 2500 (simulation still running), but the number of ordering samples you need is around 2530.

This surprises me. I expected that the non-random arrangement of letters in English would result in sampling random words taking longer than sampling random orderings of letters (and also sometimes one adjacent word will be a substring of the other, and you can’t get any info out if it).

Using a recurrence relation, this can be numerically calculated with a spreadsheet or program. When I started using the recurrence to get a closed form expression it looked like it would lead to something that would not work well numerically for a large number of draws. It would have around 25 terms and look something like:

A1*(300/325)^N + A2*(301/325)^N + A3*(302/325)^N + …
where A1, A2, … are constants to be evaluated.

For the recurrence, if p[ n, j ] is probability of having selected j of the target 25 items after n draws:
p[ n+1, j ] = p[ n, j-1 ](26-j)/325 + p[ n, j-1 ](300+j)/325
because you can get j target items after n+1 draws if you had j-1 after n draws and got a new one on next draw, or had j after n draws and selected a value that was either already selected or is not a target.

The probabilities for 2537 and 2538 draws are:
2537 0.989988515035
2538 0.990019174594

…from the Python script:

p = [0.0 for i in range(26)]
for i in range(1,2539):
for j in range(25,0,-1):
print i,p[25][/indent]

After evaluating the coefficients, it turns out the exact expression does work pretty well numerically. Probability of getting all 25 after N draws is:
P=sum(i=0 to 25) 25! / i! / (25-i)! * (-1)^i * (325-i)^N/ 325^N
Using the approximation that exp(x) is approximately (1-x/M)^M allows the series to be summed to the approximation:
(1 - exp(-N/325))^25
For a given probability, P, this may be solved, to get the approximation:
N = -325* ln(1-P^(1/25))

Brilliant! Working backwards from here, I see there is another way to proceed.

The probability of drawing all 25 balls (Y_1, Y_2, Y_3, … Y_25) out of the urn after N drawings is the probability (call it g(N)) of drawing Y_1 multiplied by the probability of drawing Y_2 (also g(N)) multiplied by the probability of Y_3 and so on. This is not quite true—the events are not independent—but close enough with a denominator as large as 325. So OP is just asking to find the N that gives g(N)^25 = 0.99. But what is g(N) ? Or, easier, what is (1 - g(N)), the probability that we never encounter a given ball in N drawings?

(1-g(N)) is just (324/325)^N, or approximately e^(-N/325). Rearrange terms and get the same excellent approximation Manlob presented.

Manlob of course! Thank you so much for the analytical solution.

If I have understood this correctly, the probability being computed is exactly N!/325[sup]N[/sup] times the coefficient of z[sup]N[/sup] in (e[sup]z[/sup]-1)[sup]25[/sup]e[sup]300z[/sup] ?

Very elegant, DPRK !

Manlob’s approximate probability (1 - e[sup]-n/325[/sup])[sup]25[/sup] will often be more practical, I think.
I wonder what the limitations are on the approximation. It helps that 325 is largish, but I think the formula’s result is in the “ballpark” even for much smaller numbers.

I feel I may be missing the point, but how would that simple equation work for the strict case of the alphabet. Are the odds of determining the placement of an x as likely as determining the placement of an e? Shouldn’t the frequency of the letters in existing words have a significant role in this equation?

Sorry for the multi post. Just reread the original post, saw that not actual words were used, but random letter groupings. I withdraw my earlier questions.

What was the job your friend interviewed for?

I’m also interested also in the variety of jobs for which that could be a valid question from a (annoyingly ball-buster?) HR interviewer.