Probability Question (binomial distribution+)

Ok, apparently this is technically hypergeometric distribution (not binomial), but whatever.
You’re running a tournament wherein 72 competitors will be broken into three 24-person sections for the preliminary rounds. Competitors will be assigned to their prelim sections randomly. You’ve identified 12 competitors whom you consider to be “elite,” and in the interest of fairness you want to know the probability that none of the sections will have an inordinately large or small number of these elite performers. We’ll say, arbitrarily, that any number from 2 to 6 (inclusive) is acceptable.

Getting in the ballpark is simple enough. Just plug these values into a hypergeometric distribution calculator and we see that a given section is 4.95% to be higher than six and 3.98% to be lower than two, and so has a 91.07% chance to fall into the acceptable range. Then to estimate the the probability that every section will fall within the acceptable range you can just do .9107^3 = 75.5%.

The problem, of course, is that the distribution of elites in one section will affect the probabilities for the other sections, and I can’t figure out how to account for that. (I do have to account for it, right? The possible distortions don’t just average out?)

Could anyone figure out the overall probability and show me how to get there? I’d appreciate it very much. Thanks.

Personally, I’d just brute-force Monte Carlo it.

Yeah, that would work. Never tried it before, though: is there a simple way to set that up? (I’ll try to find relevant instructions for building one in Excel, but I’d accept any help from someone with more experience than I.)

Simplest way to an exact solution is to note that (6,4,2), (6,3,3), (5,5,2), (5,4,3), and (4,4,4) are the only acceptable (unordered) ways to distribute the 12 elites.

The ordering of the elites can be taken as fixed. The probability of a (6,4,2) distribution is C(24,6)*C(24,4)*C(24,2)/C(72,12) ~= .02569. This must be multiplied by 6 since there are six ways to order (6,4,2).

Calculate the other four probabilities similarly, multiplying by 3, 3, 6, 1 respectively instead of 6; then sum these five probabilities to get approx. .794.

(Nevermind why you don’t just “seed” the tournament, putting 4 elites in each section. :smiley: )

Why isn’t this just a textbook example of the multinomial distribution?

Hey, that’s great, **septimus **-- never occurred to me to tackle it from that direction. Thanks!

If I could impose upon you one more time, I wonder if you could briefly explain what you’re doing here:

Sorry, I never bothered taking math courses after high school, so there are some gaps in my knowledge.

Oh, that it were so simple! I’d be happy to give the details of the real-world situation if anyone’s interested, but to keep the OP to a reasonable length I had to cut the fat.

Does that work to get the specific information I’m after? I looked at it before posting the OP but didn’t see how it would. I’m very open to the idea that I missed something, though.

Unless there’s some subtle aspect of the problem that I’m missing, the multinomial distribution solves your problem exactly. You do have to go through like septimus did and figure out which outcomes you’re looking for, but after that it’s just plug and chug.

Ok, now I am curious. Spill it.
PS. Would the groupings follow a standard distribution?

I believe you, but I’m not seeing it. Like I said: gaps in my knowledge.

Longer than necessary explanation:

I’m a coach for my old high school’s speech & debate team. Last weekend we were at the state championships, where there was a bit of a kerfuffle.

In the Congressional Debate category, there were 68 speakers divided (randomly) into three chambers. The top 8 in each chamber advanced to the semifinals (two chambers of 12), then the top half from the semis advanced to the 12-person finals (or “super session”). In the past, the placement of the top finishers has been determined solely by the ranks in the super session; that is, after the semifinals the slate was wiped clean, and each remaining competitor had (nominally) an equal chance to win.

This year, however, they decided to score the event cumulatively*, adding up all the ranks from every judge for every competitor and determining placement based on the straight total. There were extra judges in the super session to increase its weight, but now that last chamber would only count for ~36% of a finalist’s score instead of 100%.

The benefit of scoring it this way, of course, is the increased sample size. The downside is that you run the risk of putting all the best speakers in the same prelim chamber to beat each other up and split the highest ranks, thereby putting them at a relative disadvantage when it comes time to tally up all the scores (the prelim ranks would account for ~43% of a finalist’s overall score, so plenty large enough to hurt you if it’s a really tough draw).

To some extent, this appears to be what happened this year. On the morning of the first day, there was a surprising amount of unpleasantness between the tournament directors and some of the coaches/parents, since Chamber B seemed to be stacked something awful to anyone with a passing familiarity with the competitors. By my subjective tally, I did indeed find 12 “elite” speakers – those who’ve had high finishes at major national tournaments and/or dominated their own difficult leagues, and whom I would say should *expect *to be in the running for 1st place – and 8 of them were in B. (In the end, 5 of the top 8 finishers came from that one chamber, but none higher than 3rd; you can interpret this any number of ways.)

Now, normally this would have blown over once the tournament was over and everyone got back home. Unfortunately, there was a major tabulation error: in the spreadsheet, the tab room accidentally entered the wrong value in the cell for semifinals results for one competitor, a 28 instead of a 6 (lower is better). This speaker should have won the event, but because of the error he was awarded 7th. Oops! Hue & cry, angry emails, etc.

Now, the tab error is only related to the cumulative scoring controversy in the most insignificant way. However, in the mass email shamefully copping to the mistake, the tournament directors obliquely referenced the pre-tourney controversy and then said the tab error makes it imperative to “revisit the rules of Student Congress,” which I take to mean that they’ll probably kill cumulative scoring just for the sake of making a change so they can be seen to be doing something about the problem.

Wow, that was a lot of context.

So, in this email, the directors asked for people’s input on possible changes, and I intend to write them a short note arguing in favor of cumulative scoring. As part of my argument, I wanted to be able to point to some specific numbers showing that the packed-chamber phenomenon would rarely be an issue at all and so is more than outweighed by the sample-size increase and other minor issues. (And, yes, I also intend to point out that they could preempt the problem with some partial, common sense seeding based on objective criteria.)

… and then I spent most of my afternoon trying to refine 75.5% by a few percentage points. You know, priorities.

  • –> Brag: In 1997, I would have been state champion in this category had they scored cumulatively. Instead, then they were still wiping the slate, and I finished 7th. :frowning:

Yes? How do you mean?

You understand how the number of experts in the first section would be binomial if there were only two sections, right? The multinomial is just the generalization of that logic to more than two categories.

Right, theoretically that’s no problem, but I don’t understand how to actually perform the calculation in this case.

I assume you understand and have a way to calculate expressions like C(24,6) and just want to understand how I got the large expression.

There are C(24,6) ways that 6 slots (for 6 elites) can be chosen in an ordered list of 24 (18+6) people. Similarly for C(24,4) and C(24,2) so C(24,6)*C(24,4)*C(24,2) is the number of ways to assign all 12 slots with a (6,4,2) distribution.

Divide this by C(72,12), which is the total number of all assignments. Multiply by 6 to account for five other distributions with the same probability as (6,4,2): (6,2,4), (4,6,2), (4,2,6), (2,6,4), (2,4,6). This 6 (which is 3!/1!1!1!) could also be called a “multinomial” expression I guess.

The probabilities we’re calculating are for exclusive events, so we just sum them.

It might seem that our counts need to be multiplied by (12! * 60!) to account for the different ways to order the 12 elites, and to order the 60 non-elites. But this huge number would appear in numerators and denominators, so just cancel out.

(Disclaimer: I have little formal background in statistics and, ignoring signal-processing seminars and the like, last had formal math training during the Nixon Administration. Hence I address problems like this “from first principles.” This strikes me as a straightforward and reliable path. YMMV.)