Multivariate hypergeometric distribution

erislover · May 20, 2011, 5:29pm

I have a problem I’m trying to grasp, and it seems like the correct model is to use the hypergeometric distribution. I based what I’m doing on wikipedia’s page.

Here’s an approximate problem, which my tests indicate I’ve solved:

If you can calculate (hyper wins winning-pool draws sample-pool) then this is pretty easy to calculate. Missing 0 colors I’ve calculated as 16919961208577417870966/18791835269086797933015 or about 90.04%. (This is in line with my intuition, given the expected number of balls drawn of a color, which is about 2 +/-1, according to wikipedia’s E(X) and variance).

But for some reason, I am completely failing to understand how to calculate the answer I’m actually looking for, which is

My intuition on constructing this problem is failing very hard, because I end up with ridiculous answers: since the problems are so close, it seems the answers should also be close, but I get numbers like 20% or over 100%… :smack:

I’m not looking for the answer, but anything from a nudge to total hand-holding in how to set the problem up.

Thanks for any help.

ultrafilter · May 20, 2011, 8:14pm

If you write out the exact answer to the first problem in terms of binomial coefficients, you should just be able to replace one of the 19s on top with a 17 (and change the bottom to 169) to get the answer to the second problem. It’s a lot of bookkeeping, but nothing too conceptually hard.

Omphaloskeptic · May 20, 2011, 8:16pm

erislover:

I have a problem I’m trying to grasp, and it seems like the correct model is to use the hypergeometric distribution. I based what I’m doing on wikipedia’s page.

Here’s an approximate problem, which my tests indicate I’ve solved:

Suppose you have 9 sets of colored balls, 19 of each, and throw all 171 into an urn. After godlike mixing, you randomly select 19 balls. What are the chances that you have at least one of every color?

If you can calculate (hyper wins winning-pool draws sample-pool) then this is pretty easy to calculate. Missing 0 colors I’ve calculated as 16919961208577417870966/18791835269086797933015 or about 90.04%. (This is in line with my intuition, given the expected number of balls drawn of a color, which is about 2 +/-1, according to wikipedia’s E(X) and variance).

You don’t show your work, so I may be misinterpreting your result, but I don’t think this is right. Let’s start with a simpler case: N=171 balls, with m=19 white balls and 19*8=152 black balls. The (one-dimensional) hypergeometric function gives the probability of drawing k white balls in n draws-without-replacement [notation is that of Wiki’s page on the hypergeometric distribution]. In particular, the probability of drawing k=0 white balls in n=19 draws is C(152,19)/C(171,19), roughly 9.317%. [Here C(n,r) is the binomial coefficient n choose r.]

So you’ve got about a 90.68% chance of drawing at least one white ball. If all of the probabilities were independent (they aren’t, but this will give a reasonable approximation of the answer you should expect), the probability of drawing at least one of each of 9 colors would then be (90.68%)[sup]9[/sup]=41.5%.

I suggest looking back at the first problem again; maybe post again showing your work.

Buck_Godot · May 20, 2011, 8:36pm

Can you use a computer to solve this?

If so

Start off with the fact that each group must contain at least 1 ball, that leaves you with 10 balls to place among the sets.

Write a computer program to enumerate all of the different ways you can divide these balls among the 9 sets. i.e. all ways to write (n1,n2,n3,…n9) such that Sum(n1,n2,n3,…n9)=9.

Then for each example calculate the hypergeometric probability for
(n1+1,n2+1,n3+1…n9+1) vs (19,19,19,…,17)

using the notation <n,k> for n choose k this becomes

<19,n1+1><19,n2+1>…<19,n9+1>/<169,19>

Finally add up all of those probabilities.

There may be some short cuts that allow you to do this by hand, but if so I’m not seeing them right off hand.

erislover · May 20, 2011, 8:50pm

Well, my “work” is more like “a program”


(define (hypergeo wins winning-pool samples total-pool)
  (/ (* (choose-from winning-pool wins)
        (choose-from (- total-pool winning-pool) (- samples wins)))
     (choose-from total-pool samples)))

Then the multivariate case calculates
(choose win pool) * (choose win pool) * … * remainder / (choose draws total pool)
Which also seems straightforward, though it may not read well:


(define (multi-hypergeo list-of-win/pool samples total-pool)
  (let ((remainder-pool 
         (- total-pool (apply + (map second list-of-win/pool))))
        (remainder-draws 
         (- samples (apply + (map first list-of-win/pool)))))
    (let ((remainder (list remainder-pool remainder-draws)))
      (/ (* (apply choose-from remainder)
            (apply * (map (λ(pair) (apply choose pair)) list-of-win/pool)))
         (choose-from total-pool samples)))))

These procedures correctly calculate the given examples on wikipedia, an admittedly limited sampling.

To calculate the chances of missing 0 colors, I subtract from 1 the probability of missing exactly 1 color + exactly 2 colors + … + exactly 8 colors.


(define (approx-miss n)
  (if (zero? n)
      (- 1 (apply + (map approx-miss (list-from-to 1 8))))
      (multi-hypergeo (list-n n '(0 19)) 19 171)))

The '(0 19) is missing exactly one color. Multiple listings of '(0 19) would be missing multiple colors.

This seems correct in the sense that
(approx-miss-n 8) ; select all the same color
Is the exact same value as
(hypergeo 19 19 19 171) ; select all the same color

I agree with this calculation exactly. I must be assuming an erroneous symmetry or leaving out some results.

Omphaloskeptic · May 20, 2011, 9:19pm

erislover:

To calculate the chances of missing 0 colors, I subtract from 1 the probability of missing exactly 1 color + exactly 2 colors + … + exactly 8 colors.
(define (approx-miss n)
  (if (zero? n)
      (- 1 (apply + (map approx-miss (list-from-to 1 8))))
      (multi-hypergeo (list-n n '(0 19)) 19 171)))
The '(0 19) is missing exactly one color. Multiple listings of '(0 19) would be missing multiple colors.

Ah, OK. But here you are (I think; I don’t have time right now to check your code) actually computing the probabilities of missing at least one given color, two given colors, etc. You have to consider the probability that you have no white marbles (but marbles of all eight other colors), plus the probability that you have no red marbles (but marbles of all eight other colors), and so on. Then you have to compensate for overcounting, e.g. with Inclusion-Exclusion.

Instead, the most straightforward way to do this, since you have a computer, is probably Buck Godot’s approach, of finding all ordered partitions of 19 into 9 positive terms. [You should find 43758 distinct partitions.] If you didn’t have a computer, you could avoid wearing out your fingers and pencils by computing the unordered partitions instead–there are only 41 of these–but then you have to compute the degeneracy of each partition (the number of ordered partitions corresponding to this unordered partition) as well. Here they are (spoilered for space), in case you want to take this route and check your work:[spoiler]


   9  (1, 1, 1, 1, 1, 1, 1, 1, 11)
  72  (1, 1, 1, 1, 1, 1, 1, 2, 10)
  72  (1, 1, 1, 1, 1, 1, 1, 3, 9)
  72  (1, 1, 1, 1, 1, 1, 1, 4, 8)
  72  (1, 1, 1, 1, 1, 1, 1, 5, 7)
  36  (1, 1, 1, 1, 1, 1, 1, 6, 6)
 252  (1, 1, 1, 1, 1, 1, 2, 2, 9)
 504  (1, 1, 1, 1, 1, 1, 2, 3, 8)
 504  (1, 1, 1, 1, 1, 1, 2, 4, 7)
 504  (1, 1, 1, 1, 1, 1, 2, 5, 6)
 252  (1, 1, 1, 1, 1, 1, 3, 3, 7)
 504  (1, 1, 1, 1, 1, 1, 3, 4, 6)
 252  (1, 1, 1, 1, 1, 1, 3, 5, 5)
 252  (1, 1, 1, 1, 1, 1, 4, 4, 5)
 504  (1, 1, 1, 1, 1, 2, 2, 2, 8)
1512  (1, 1, 1, 1, 1, 2, 2, 3, 7)
1512  (1, 1, 1, 1, 1, 2, 2, 4, 6)
 756  (1, 1, 1, 1, 1, 2, 2, 5, 5)
1512  (1, 1, 1, 1, 1, 2, 3, 3, 6)
3024  (1, 1, 1, 1, 1, 2, 3, 4, 5)
 504  (1, 1, 1, 1, 1, 2, 4, 4, 4)
 504  (1, 1, 1, 1, 1, 3, 3, 3, 5)
 756  (1, 1, 1, 1, 1, 3, 3, 4, 4)
 630  (1, 1, 1, 1, 2, 2, 2, 2, 7)
2520  (1, 1, 1, 1, 2, 2, 2, 3, 6)
2520  (1, 1, 1, 1, 2, 2, 2, 4, 5)
3780  (1, 1, 1, 1, 2, 2, 3, 3, 5)
3780  (1, 1, 1, 1, 2, 2, 3, 4, 4)
2520  (1, 1, 1, 1, 2, 3, 3, 3, 4)
 126  (1, 1, 1, 1, 3, 3, 3, 3, 3)
 504  (1, 1, 1, 2, 2, 2, 2, 2, 6)
2520  (1, 1, 1, 2, 2, 2, 2, 3, 5)
1260  (1, 1, 1, 2, 2, 2, 2, 4, 4)
5040  (1, 1, 1, 2, 2, 2, 3, 3, 4)
1260  (1, 1, 1, 2, 2, 3, 3, 3, 3)
 252  (1, 1, 2, 2, 2, 2, 2, 2, 5)
1512  (1, 1, 2, 2, 2, 2, 2, 3, 4)
1260  (1, 1, 2, 2, 2, 2, 3, 3, 3)
  72  (1, 2, 2, 2, 2, 2, 2, 2, 4)
 252  (1, 2, 2, 2, 2, 2, 2, 3, 3)
   9  (2, 2, 2, 2, 2, 2, 2, 2, 3)

[/spoiler]

Lance_Turbo · May 20, 2011, 11:38pm

The answer for the first problem is 0.362796 and the answer for the second is 0.362022. Both answers are approximate but I can give exact if needed.

I computed this by considering all ordered partitions of 19 of nine or less parts. The ones with less than nine parts represent the bad cases and were padded with zeroes. This is pretty much what Omphaloskeptic was talking about.

However, each of these ordered partitions are not equally likely. So I replaced each part of each partition with Binomial[19,p] (or Binomial[17,p] as needed) and multiplied the parts together. This gives the number of ways to choose each ordered partition.

Then it’s just a matter of summing and dividing.

Finally I wrote a simulation and the results supported my answer.

The whole thing including the sim is about ten lines of Mathematica which I can share upon request.

Lance_Turbo · May 21, 2011, 2:32am

I decided to write a function in Mathematica to compute the general case where I’m picking p balls from a bin with c1 balls of color one, c2 balls of color 2, etc. I was quite pleased with result so I thought I’d share it.


Prob[picks_, bins_] := 
 Plus@@(Times@@Table[Binomial[bins[li], #[*]], [/li] {i, 1, Length[bins]}]&/@(Flatten[Permutations/@IntegerPartitions[picks, 
 {Length[bins]}], 1]))/Binomial[Plus@@bins, picks]

Calling it with parameters:


Prob[19, {19, 19, 19, 19, 19, 19, 19, 19, 19}]

and


Prob[19, {19, 19, 19, 19, 19, 19, 19, 19, 17}]

Gives the answers to the first and second problems in the OP.

Mathematica’s IntegerPartitions function does a lot of the heavy lifting here, but the rest of it should be a pretty straightforward translation into any other language.

erislover · May 21, 2011, 6:05pm

Thanks for your help, guys. I didn’t understand what you meant by partition at first until the spoilerized list. (I mean, it wasn’t obvious how it applied.) I’ve never heard this word “degeneracy” used that way, and I can’t seem to dig up a reference for counting this. Looks like it is
length! / (repeats! * … * repeats!)
e.g., the degeneracy of the partition of 19 (1, 1, 1, 2, 2, 2, 3, 3, 4) is
9! / ( 3! 3! 2! ) = 5040

It’s not obvious to me why this is the number, or how I could have known that I needed this number, but my understanding of probability and combinatorics is absolutely dismal. Omphaloskeptic, is there some reference lurking around for this degeneracy factor? The one reference I could dig up says only, “…which is clearly a classical degeneracy factor.” Clearly :o

Generating the partitions seems easily amendable to brute force and time, but much faster than generating all permutations of those partitions, so this counting method is good. I’ll continue plugging away at it. Thanks for your help.

Lance, could you tell me what would happen if there were 8 colors instead of 9 (19 balls of each color = 152 balls)? I’m looking to physically realize the 9 color situation and need to advise against it, given these odds, but I have already physically realized the 8 color situation. In my three tests (wow, three whole tests) one of each color was indeed present. I’m interested to know how much of a change there is. (Also, if I can successfully calculate it, it’d be nice to have to answers to check. I’m an empirical mathematician… )

Lance_Turbo · May 21, 2011, 7:55pm

OK, I think I have come up with a way to do this that is easily implementable without resorting to Mathematica’s more involved built in functions.

First, yesterday I used the the term ordered partition where I should have used the word composition. That was my mistake.

A composition of p in b parts is a sequence (order matters) of b integers (greater than zero) that add up to n. For example there are two compositions of 3 with two parts (1,2) and (2,1).

In your problem you are picking a total of p balls which are each one of b colors. Each distinct composition of p corresponds to a term in a large sum of hypergeometric probabilities.

In order to solve your problem you would like to generate all compositions of p with exactly b parts. Fortunately, there is a bijection between compositions of p with b parts and subsets of set with p-1 elements of cardinality b-1.

I’ll sketch a proof of that. Consider a list of p 1’s. There are p-1 gaps in between them. If I place b-1 commas in those gaps and p-b+1 pluses, then after performing the addition what remains is a composition of p into b parts. Finally it is well known p-1 choose b-1 is the number of cardinality b-1 subsets of a set with p-1 elements.

E.g. Consider a 2 element subsets of {1,2,3,4} say {1,3}. Place commas in the first and third gap in a list of 5 1’s and pluses elsewhere to get a 3 part composition of 5.

{1,3} -> 1,1+1,1+1 -> (1,2,2)

OK, so now if we can generate all the subsets of a set with p-1 elements that have cardinality b-1 we generate all the compositions we need. But this is super easy to do with b-1 nested For loops.

E.g. For[(i,1,3), (j,i+1,4), {i,j}] yields (in no particular programming language):
{1,2}
{1,3}
{1,4}
{2,3}
{2,4}
{3,4}

Which are all the 2-subsets of {1,2,3,4}

These can be converted into the 3 part compositions of 5 using the following trick:

{a,b} -> (a,b,5) - (0,a,b) = (a, b-a, 5-b) (This extends naturally to larger sets in the obvious way)
e.g
{1,3} -> (1,3,5)-(0,1,3) = (1,2,2) as above.

OK, now suppose we have 6 red balls, 5 blue balls and 3 yellow balls, we will pick 5 of them at random, and wish to know the probability of having one of each color.

Our 2-subsets of {1,2,3,4} are converted to these 3 part compositions of 5:

{1,2} -> (1,1,3)
{1,3} -> (1,2,2)
{1,4} -> (1,3,1)
{2,3} -> (2,1,2)
{2,4} -> (2,2,1)
{3,4} -> (3,1,1)

Now each composition corresponds to the hypergeometric probability of choosing that composition.

e.g. (1,3,1) -> (6,1)(5,3)(3,1)/(14,5)

I’m using (a,b) here to mean the a choose b. The number on the left of each pair is the number of balls of that color and the number on the right is the corresponding number from the composition. The denominator is self explanatory.

Add all those bad boys together and you get your answer which for this example will be 15/22.

I could go into more detail but this post is too long already and I feel like there is enough info here that you should be able to extend it your problem or any similar problem without too much trouble. Of course, I’ll probably be around if you have any questions.

Lance_Turbo · May 22, 2011, 2:33am

A question I have for you occured to me. All your 19’s, 17’s, 9’s and 8’s seem curiously specific. Did this problem arise in a natural way?

erislover · May 22, 2011, 6:47pm

On the degeneracy issue:

From Multinomial theorem - Wikipedia

Lance, it’s going to take me a bit to digest that. But to answer your question, in reality I am attempting to align fiber optic cables. I have a large bundle gathered from 9 sources. I can conveniently gather them in a bundle of 169 fibers (8 * 19 + 1 * 17)[sup]*[/sup], but what is going to look at them is, inconveniently, only something capable of seeing about 19 of them. Numbers like 19 are what you get when you honeycomb pack circles in circles. (1, 7, 19, …) As you can see, the colored balls are a perfect analogy. Too bad the ease of analogy did not correspond to ease of calculation!

[sub][sup]That is, I already have something for this purpose in a different application. It’d be very nice if it could be used here, but it looks like the answer is “no.”[/sup][/sub]

erislover · May 23, 2011, 6:19pm

Just wanted to follow-up here.

I’ve successfully managed to calculate the 9*19 case (171 balls) using unordered partitions and the multinomial coefficient factor, yielding
6149208111577038261331/16949498477999856959190, or 36.28%.

Lance, I like this idea of generating compositions for the more general case where all the color-pools aren’t the same size. I haven’t given much thought to extending this unordered partition method to the different-sized pools. Seems straightforward, but I’ve already showed my ass on things I considered straightforward.

Thanks so much, everyone. It was a very educational diversion.

Lance_Turbo · May 23, 2011, 7:56pm

It might be straightforward but it won’t give you the correct answer. That method requires that the probability of each partition is homogeneous with respect to ordering. This is not the case when the color pools are not equal.

erislover · May 23, 2011, 9:04pm

This is why the composition idea is powerful. Otherwise I’m back to my original problem in the OP, even though I’ve learned much on the way.

Lance_Turbo · May 24, 2011, 5:28am

I feel like I must not have explained the composition method very well or you would have just done it.

Here’s how one would generate the answer with 8 nested For loops. The code is in Mathematica, but it should be easily translatable.



numerator = 0;

For[i1 = 1, i1 < 12, i1++,
 For[i2 = i1 + 1, i2 < 13, i2++,
  For[i3 = i2 + 1, i3 < 14, i3++,
   For[i4 = i3 + 1, i4 < 15, i4++,
    For[i5 = i4 + 1, i5 < 16, i5++,
     For[i6 = i5 + 1, i6 < 17, i6++,
      For[i7 = i6 + 1, i7 < 18, i7++,
       For[i8 = i7 + 1, i8 < 19, i8++,
        numerator +=
          Binomial[19, i1]*
          Binomial[19, i2 - i1]*
          Binomial[19, i3 - i2]*
          Binomial[19, i4 - i3]*
          Binomial[19, i5 - i4]*
          Binomial[19, i6 - i5]*
          Binomial[19, i7 - i6]*
          Binomial[19, i8 - i7]*
          Binomial[17, 19 - i8]
        ]
       ]
      ]
     ]
    ]
   ]
  ]
 ]

numerator

denominator = Binomial[169, 19]

numerator/denominator // N

This outputs the numerator of 2223721194012220900416280, the denominator of 6142498248427148162010456, and their ratio 0.362022.

erislover · May 26, 2011, 7:01pm

Hi Lance. I did indeed “just do it,” after my previous post, I just didn’t mention it. Thank you very much for your help.

I’m not quite sure yet how to solve the more general problem using this method. I mean, the n-colors with minimum successes problem. (In my personal use of probability, I normally look for “n or more wins” rather than “exactly n wins.”) Generating all these permutations smells a lot like an edge case to me, rather than a sound general solution.

I had the suspicion that one only needed to extend the partitions to cover the case of the 17 being in each position, and then to use the degeneracy factor to cover the remaining indistinguishability in the counts. That is, specifically, for the partition
(8 + 2 + 2 + 2 + 1 + … + 1)
Which normally has a degeneracy factor of 504, one need only expand this single partition to 9 duplicates, where the 17-pool is associated with each term, and use the degeneracy of the subsequent partitions as a factor, rather than generating all permutations of the terms.

Even if this is sound, it still may be slower than just looping up permutations, because I am not a great programmer, either. Still… the permutation-style calculation has a big weakness when we solve the more general problem, like, “At least 1 of six of the colors,” or some such. The permutation space grows very fast. I mean, it’s essentially going to be an O(n!) average calculation—which is horrible when O(n!) should be worst-case (because we really do have to look at almost all permutations).

But, this is all idle chat, as the original question has been answered, and I’ve been able to follow what’s going on to the point of being able to generate the answer myself, thanks to everyone’s patience and help.

Omphaloskeptic · May 30, 2011, 4:59pm

erislover:

Thanks for your help, guys. I didn’t understand what you meant by partition at first until the spoilerized list. (I mean, it wasn’t obvious how it applied.) I’ve never heard this word “degeneracy” used that way, and I can’t seem to dig up a reference for counting this. Looks like it is
length! / (repeats! * … * repeats!)
e.g., the degeneracy of the partition of 19 (1, 1, 1, 2, 2, 2, 3, 3, 4) is
9! / ( 3! 3! 2! ) = 5040

It’s not obvious to me why this is the number, or how I could have known that I needed this number, but my understanding of probability and combinatorics is absolutely dismal. Omphaloskeptic, is there some reference lurking around for this degeneracy factor? The one reference I could dig up says only, “…which is clearly a classical degeneracy factor.” Clearly :o

Sorry I didn’t get back to you on this. It looks like you figured it out on your own, though—yes, this is a multinomial coefficient, giving the number of ways we can rearrange this partition into the 9 color bins.

As for the “degeneracy factor” terminology: I don’t know if that’s a common name for this factor, and at any rate I don’t have a reference in mind showing that usage. I’ve also seen similar factors called “multiplicity” or “symmetry” factors (though again I don’t have a reference off the top of my head). Aside: In linear algebra, one often talks about degenerate eigenvalues; in quantum mechanics, about degenerate energy levels (which are also eigenvalues). In quantum mechanics especially, this degeneracy is usually due to some symmetry of the system, so there’s a strong connection between this “degeneracy” and symmetry.

Here the idea is that we are trying to sum the values of some function over all valid compositions (which I was calling “ordered partitions” earlier); but the value of the function doesn’t depend on the order of the terms in the composition, so we can reduce the amount of work we have to do by counting all of the compositions which are just reorderings of each other, compute the function for one of them, and then multiply by the number of reorderings.

erislover:

I’m not quite sure yet how to solve the more general problem using this method. I mean, the n-colors with minimum successes problem. (In my personal use of probability, I normally look for “n or more wins” rather than “exactly n wins.”) Generating all these permutations smells a lot like an edge case to me, rather than a sound general solution.

I had the suspicion that one only needed to extend the partitions to cover the case of the 17 being in each position, and then to use the degeneracy factor to cover the remaining indistinguishability in the counts. That is, specifically, for the partition
(8 + 2 + 2 + 2 + 1 + … + 1)
Which normally has a degeneracy factor of 504, one need only expand this single partition to 9 duplicates, where the 17-pool is associated with each term, and use the degeneracy of the subsequent partitions as a factor, rather than generating all permutations of the terms.

Even if this is sound, it still may be slower than just looping up permutations, because I am not a great programmer, either. Still… the permutation-style calculation has a big weakness when we solve the more general problem, like, “At least 1 of six of the colors,” or some such. The permutation space grows very fast. I mean, it’s essentially going to be an O(n!) average calculation—which is horrible when O(n!) should be worst-case (because we really do have to look at almost all permutations).

I’m not sure what “more general problem” you’re talking about here–is this the 819+117 problem you describe in your OP, or something else?

About the “edge case” comment—if you mean that the use of unordered partitions and degeneracy factors only works with all 9 colors having the same number of balls, then this is true. But your idea for solving the 819+117 problem will work. Referring back to my description of the source of these “degeneracy factors” you can see that in the second problem it’s no longer the case that any reordering of a composition gives the same value; however, any reordering among the first 8 terms (representing the 8 colors with 19 balls) still does give the same value. So now you’d have to consider something in between an (unordered) partition and an (ordered) composition, where the final term is distinguished from the first 8. So for example the line


 252  (1, 1, 1, 1, 1, 1, 2, 2, 9)

in the above table must be expanded to


  28  (1, 1, 1, 1, 1, 1, 2, 2; 9)
  56  (1, 1, 1, 1, 1, 1, 2, 9; 2)
 168  (1, 1, 1, 1, 1, 2, 2, 9; 1)

computing the “degeneracies” in the same way as before, where now you are only considering reorderings of the first 8 terms. (The resulting table now has 136 “partially-ordered” partitions instead of just 41, still an improvement on the 43758 compositions).

For the problem sizes you gave here (919 and 819+1*17) a computer can find all of the compositions and compute the answer in a few seconds. But you are right that as the problem size grows the number of compositions grows very rapidly, so this brute-force solution is not all you could wish for. If there’s enough symmetry in your problem, you can reduce the number of terms with approaches like the above (here, by factors of something like 9! and 8!, at best).

The number of (unordered) partitions still grows pretty rapidly, though. I’ll end this post for length; in the next post I’ll describe another approach which is more computationally feasible in the symmetric case (C colors, B balls of each color), based on Inclusion-Exclusion. (Like the approach above, it gets more complicated when you have fewer symmetries. I don’t know which approach, if either, is faster for the general asymmetric problem.)

Omphaloskeptic · May 31, 2011, 4:19pm

The idea of the inclusion-exclusion counting principle is that it may be easier to double-count some values and then compensate afterward by subtracting off the values you counted twice. For a very simple example, consider your problem but with only c=3 colors (Red, Green, Blue) and b=5 balls of each color; you draw d=4 balls and want to know the probability of getting at least one of each color. We will count the number of ways that we don’t draw all colors, then subtract this from the total number C(b*c,d)=C(15,4)=1365 of ways of drawing our four balls from the fifteen.

We can start by counting the number of ways that we can wind up with only two colors of balls. We can end up with only red and green balls in C(10,4)=210 ways; similarly there are 210 ways to draw only red and blue, and 210 to draw only green and blue. Likewise there are C(5,4)=5 ways of drawing just red balls, 5 ways of drawing just green balls, and 5 ways of drawing just blue balls. Adding these all up gives


RG    RB    GB    R   G   B
210 + 210 + 210 + 5 + 5 + 5 = 645

But we’ve overcounted; among the 210 ways of drawing only red and green balls we’ve also included the 5 ways of drawing only red balls and the 5 ways of drawing only green balls. So each of the three “210” terms includes two extra “5” terms, which we must exclude from the sum:


(RG-R-G)    (RB-R-B)    (GB-G-B)    R   G   B
(210-5-5) + (210-5-5) + (210-5-5) + 5 + 5 + 5 = 615

and there are 615 ways that we can end up without at least one of each color. So there are 1365-615=750 ways of ending up with at least one of each color. (You can verify this directly using the partition approach; there’s just one partition, (1,1,2), of degeneracy 3, so you have 3C(5,1)C(5,1)C(5,2)=35510=750 ways of drawing at least one of each color.)
Now for the general symmetric problem (c colors, b of each color, draw d (d>=c)): Let’s define X(n,m) as the number of ways of drawing from a pool of n colors (b of each color, so we are drawing from a pool of bn balls) in such a way that we draw exactly m different colors. (So we are interested in X(c,c).) To compute X(n,m), begin by counting all of the ways of choosing m of the n colors (C(n,m) ways) and then selecting d balls from this pool of mb balls (C(mb,d) ways); this is
C(n,m)C(mb,d) .
But this is an overcounting, since it counts each result in which we end up with fewer than m colors more than once, where we want it to count not at all. So let’s figure out how many times these unwanted results are counted, and subtract them out. A draw containing a particular set of k different colors (0<k<m) will be counted in each of the "C(mb,d)" terms for which the m chosen colors contain these k. So for each of the “C(m*b,d)” terms we have to subtract off the number of draws from a pool of m colors for which exactly k different colors are drawn—that is, X(m,k)—for each k in 0<k<m. This gives us a recursion relation for X:

X(n,m) = C(n,m) * [ C(m*b,d) - ( X(m,1) + … + X(m,m-1) ) ] .

Why is this any better than the partition approach? The number of partitions of d, even restricted to exact length c, grows pretty quickly with c and d (in particular, I’m pretty sure it’s faster-than-polynomial growth); but evaluating X(c,c) involves only computation of values of X(n,m) with m<=n<=c, so there are at most ~c[sup]2[/sup]/2 such distinct values; memoization of the intermediates makes this a quick polynomial-time computation.
So this is a prettier counting method in the symmetric case at least. If we wanted to handle non-symmetric versions (not all colors having the same number of balls) then we’d have to replace the product C(n,m) * [ … ] with a separate treatment of each choice of m colors appearing different numbers of times. (So in your second case, for example, you’d have one case in which the m colors included the 17-ball color, which would occur C(n-1,m-1) times; and one in which it didn’t, which would have a multiplier C(n-1,m). The [ … ] also has to keep track of which colors are allowed.) This gets quite a bit more complicated, and I haven’t tried to work it out to see if it’s still an improvement over the composition formulation.

Topic		Replies	Views
Probability (n balls and n containers) Factual Questions	40	2289	August 14, 2004
Probability Question (binomial distribution+) Factual Questions	12	1902	May 7, 2012
Dope seeks simple math answer... Factual Questions	14	2078	October 11, 2016
Yet another probability/combinatorics question involving dice... Factual Questions	12	2317	February 18, 2009
Another 2nd grade math question: What would this be called? Factual Questions	18	1499	January 15, 2005

Multivariate hypergeometric distribution

Related topics