This sounds kind of like homework, but it’s not. Some backstory on how this came up: A friend and I were talking about some questions related to a language puzzle that had been presented to him as a coding problem for an interview.

One question that came out of this was: “How long an alphabetized list of random english words do you need to have to figure out the order of the alphabet?”. In case anyone is curious about that question: I wrote a Monte Carlo simulation of this and you get to 99+% with a list of 2500 words (from the dictionary I was using)

But a sort of related question came to mind, that feels like it ought to have a nice closed form answer, but I can’t figure it out.

Each pair of adjacent words in an alphabetization list gives you at most one partial ordering. For example, (bat, zebra) gives you “b is earlier than z” and (alpha, alpine) gives you “h is earlier than i”. In order to figure out the ordering, you need to end up looking at words that give you the 25 orderings for letters right next to each other: (a,b), (b,c), (c,d)…, out of all possible 26*25/2 = 325 orderings.

So, if we were selecting, not actual words, but just orderings, randomly, how many should we have to select to get, say, a 99% chance that we’ve

Or, more generally, how many selections do I have to make from 1-M (325) to be X% (99) sure that I’ll have gotten (1-N) (25)?

Based on my simulation and the fact that letters are not randomly distributed in English, I expect the answer to be less than 2500. Maybe around 1000?