Math Dopers: A little help?

I’ve been working on a probability set…

Don’t worry, it’s not for school.

Out of a universe of 60,000 names I pulled two groups.

One is 6100 people.

One is 3500 people.

They were pulled from the file non-exclusively. That is, the same people could be in the two smaller cohorts.

The maximum numbers of identical people is (obviously) 3500.

The minimum number of matches is zero.

My question is: how do I find the most likely number of matches without looking at the files in question.

I know there has to be a way to acheive that number…I just can’t remember how. Can someone help?

Say group A has 6100 names and group B has 3500 names.

The probability that a given name from the original 60,000 ends up in group A is

6100/60,000.

The probability that a given name ends up in group B is

3500/60,000.

Therefore, the probability of a given name ending up in both is

(6100*3500)/(60,000^2).

Hence, the expected number of people on both lists is

(61003500)/(60,000^2) * 60,000 = (61003500)/60,000 = (61*35)/6,

or approximately 335.833333.

Sorry; made a typo at the very end. The expected number is approximately 355.833333

Technically, this number would be the “expectation value” of the number of people on both lists, i.e. if you randomly selected the cohorts in the way you prescribe over & over many times, this would be the average number you’d get out. The actual “most likely value” may not be the same as this. As an extreme example of how these two can differ, the first method would give you the result that a person selected at random from the US population is expected to have one breast and one testicle; but since there are slightly more women in the US than men, the “most likely” value would be two breasts, no testicles.

That said, I don’t think (with numbers this large) that the two numbers are likely to vary greatly in this case. I’ll leave the verifaction of this for someone else to actually calculate, though. :slight_smile:

Let me make sure that I understand the question. Here’s what I think you’re saying. From the 60,000, you choose 6100 without replacement (individuals can occur in the 6100 only once). Then, starting from the original 60,000, you choose 3500 without replacement (meaning that individuals can only occur in the 3500 once; I understand that the 6100 are “put back”, which is the whole point). You’re interested in how many individuals are in both the 6100 and the 3500.

If this is right, then I believe the probabilities are given by a hypergeometric distribution. This is a fairly well-known distribution–not as popular as the normal or the binomial, but in the top ten–so you should be able to find information on it easily. This will let you plot the distribution, determine a standard deviation, compute probabilities of being in a certain interval, and so on, if you care. The expectation is, I believe, what Tyrrell McAllister says it is. In this particular case, since 3500 is much less than 60,000, a binomial distribution would probably give you a reasonable approximation.

Here’s how I figure it:

Say you’ve already picked the 6100 people for group A.
Now, each time you select a name for group B, it has a probability of p = 6100/60000 of being a name that’s already in group A.

One can use the formula for Binomial Probability (which I can come back and explain later if necessary) to find the probability of X number of matches out of the 3500 people selected for group B. The expected number of matches (as explained by MikeS) is np = (3500)*(6100/60000), which = the 355.83 that Tyrell McAllister got.

I’m not totally satisfied with this answer, because it doesn’t take into account the fact that the selection of Group B is done without replacement, but I don’t have time to agonize over it now. So, more knowledgeable Dopers, feel free to attack my answer!

That’s not a problem. Instead of thinking of people being removed from the group of 60,000, think of, say, handing 6100 of them a green card and 3500 of them a red card in such a way that the assignment of green cards is independent of the assignment of red cards. Turning up in both lists then translates to getting both a green and a red card. Therefore you can just multiply probabilities to get the probability that a given individual gets both a green and a red card. This is what I did in my post.

It’s been a while since my statistics classes but it looks like a straightforward binomial to me, here’s my reasoning:

To make it easier to visualize I thought of a box of 60,000 pingpong balls. 6100 of them, picked at random, get a red dot painted on. Those are then dumped back into the box. 3500, again picked at random, get a blue dot painted on.

What is the most likely number of pingpong balls to have both a red and a blue dot?

The only event we’re interested in is a ball with both dots. The probability of any given ball being both red and blue is (6100/60000)*(3500/60000) which works out to 0.0059305555…call that p.

Now we apply the binomial distribution to figure out the chance of 0 hits in 3500 trials, 1 in 3500, etc. all the way up to 3500 hits in 3500 trials. We’ve only got 3500 trials because that’s the maximum number of balls that can have both dots.

P(0 balls have two dots) = (Binomial coefficient for 0 hits in 3500 trials)((p^0))((1-p)^(3500-0))

=3500!/(0!3500!) * p^0 * (1-p)^3500
=11(1-p)^3500
=(.99406944444)^3500
=9.089e-10

(For example)

The math gets icky when you have factorials and exponents that large but Excel shows me the maximum probability (mode) occurring at about 20 hits, p=0.08784681. At about 53 or 54 hits the probability is close to what it’d be for 0 hits again, beyond that it very quickly gets vanishingly small.

Look about right?

I like Valgard’s model with the ping-pong balls and the dots. But I don’t think that this is a binomial distribution.

Here’s how I’d compute what is the most likely number of pingpong balls with both a red and a blue dot. Let n be the total number of balls. Let r be the number given a red dot, and let b be the number given a blue dot.

Given an integer m, what is the probability that exactly m balls get both dots? First, we compute the number of ways this can happen:

(# of ways to paint r - m balls out of n just red)

x (# of ways to paint b - m balls out of n - (r - m) just blue)

x (# of ways to paint m balls out of n - (r - m) - (b - m) = n + 2m - r - b balls both red and blue).

Thus the number off assignments of dots that leave exactly m balls with both a red and a blue dot is

[sub]n[/sub]C[sub](r - m)[/sub] x [sub](n - r + m)[/sub]C[sub](b - m)[/sub] x [sub](n + 2m - r - b)[/sub]C[sub]m[/sub].

Now, the total number of assignments of dots is [sub]n[/sub]C[sub]r[/sub] x [sub]n[/sub]C[sub]b[/sub]. Thus the probability of an assignment leaving exactly m balls with both a red and a blue dot is the ratio of these last two values:

([sub]n[/sub]C[sub](r - m)[/sub] x [sub](n - r + m)[/sub]C[sub](b - m)[/sub] x [sub](n + 2m - r - b)[/sub]C[sub]m[/sub]) / ([sub]n[/sub]C[sub]r[/sub] x [sub]n[/sub]C[sub]b[/sub]).

I don’t have access to any good computational software at the moment, but if someone would like to plug this into Excel with the values n = 60,000, r = 6100, b = 3500, I think it would answer the OP’s question.

Like I said, it’s a hypergeometric, not a binomial. The hypergeometric is in some ways like the binomial, but it deals with the case where you’re drawing without replacement. I won’t reproduce the equations or the reasoning because it’s standard stuff; most probability texts should cover it, or have a look at http://en.wikipedia.org/wiki/Hypergeometric_distribution.

Guys, I forgot most of this stuff (except the math I need for business statistics) about 10 years ago.

Can you lower the level on those last two answers for a poor old man?

Ignore those posts because they’re irrelevant to what you need to know. The most likely number of people in both groups is 356. It’s possible to calculate the probability that the number of people in both groups is any of the possible answers (from 0 to 3500), but that’s not what you wanted to know. The most likely number is 356, and that’s all you need to know.

I basically second what Wendell Wagner said, if you only care about the average. If you care about the variation around that average, look into the hypergeometric.

I agree with Valguard’s approach, though not his numbers. The point is that the question is not what the average number of people is, but what is the most likely number of matches is. As MikeS suggests, we need to find the mode rather than the mean.

The place where I disagree with Valguard is that I think his calculation of p is wrong. If it were right, the sum over all the possible number of answers should add up to 1, but if you sum them, it’s much less. The way I look at it, you first “mark” 6,100 people out of the 60,000, and then pick 3,500 people from the 60,000 at random. You then sum the number of people who you’ve now selected who have been marked by the initial selection. In that model, p would be 6100/60000 = 0.10166666…

In bc, the Unix calculator, I defined the three functions:


define f(a) {
 if (a > 1)
  return (a * f(a-1));
 return (1);
}

define c(a,b) {
 return (f(a)/f(b)/f(a-b));
}

define m(p,n,t) {
 return (c(t,n) * p^n * (1-p)^(t-n));
}

Where f is the factorial function, c is the choose function, and m is the chance that you get n successes out of t trials when the probability for one success is p. I ran this with 450 digits of precision, where p = 0.101666666666667 and t = 3500. The function m() hit its maximal value at n = 355 successes, which interestingly enough is quite close to the mean calculated by Tyrrell McAllister at the beginning of this thread.

A sample of the probabilities I got:

n = 0: 1.07473603544798e-161%
n = 300: 0.01436011140540525%
n = 350: 2.13020672765995334%
n = 354: 2.22419553669614445%
n = 355: 2.23071660438062294%
n = 356: 2.23026321614830110%
n = 400: 0.11137275995696081%
n = 450: 4.97431757661914e-6%

And summing up all the probabilities up to n = 450, the probability that 0 <= n <= 450 is 99.999984284%, which implies that we are converging to 100%, which is what we expect.

So, to sum up, if you ran a simulation many times, the average number of successes that you’d get would probably be close to 356, but the most common result would probably be 355 successes. How many is many? Well, the probability for 356 successes is so close to 355, you would need to run it billions of times. If I recall the equation correctly, you’d need to run it at least 1/(0.0223071660438062294 - 0.0223026321614830110)[sup]2[/sup] = 48,647,386,373 times to get a statistically significant difference.

Duh. Yes, Punoqllads is right, I messed up my value of p. It’s true that there’s about a 0.6% chance that a given individual out of the original set of 60,000 is in both group 1 and group 2 (or to put it another way, about 0.6% of that population will be in both groups, thus an expected value of just under 360), however when we get to using the Binomial distribution we are NOT looking at the original sample space of 60,000 people, we are looking at the smaller sample space of ONLY those people already in one group.

You can run the calculation two ways - assume that you’re looking at all the members of group 1 (6100 people), and the chance that any given one of 'em is a member of group 2 is 3500/60,000, or look at group 2 (3500 people) and the chance that one of those is a member of group 1 is 6100/60,000.

Then calculate the binomial coefficient and you’re off and running. Regardless of which way you do it you should get the same result.

Hard to believe I used to teach this stuff in college :slight_smile:

Dunno why I woke up thinking about this but I think both previous approaches are still a bit off.

Using p=6100/60,000 and the binomial distribution for 0-3500 successes out of 3500 trials gives the probability of a certain number of “hits” (member of both group 1 and group 2) GIVEN that you already a member of group 2. It does not give the probability of a certain number of hits for the total population of 60,000 at large.

I have to run off to work but I’ll noodle it out there and post an update later.

That’s because it’s not really a binomial distribution, though it happens to be one that can be approximated by a binomial distribution. The precisely correct formula is the one in Josh_dePlume (which is a simplification of the one in my last post; it’s also more transparantly true).

So, Punoqllads, perhaps you could re-run your code changing



define m(p,n,t) {
 return (c(t,n) * p^n * (1-p)^(t-n));
}


to



define m(N,r,b,n) {
 return (c(b, n) * c(N-b, r-n) / c(N, r));
}


where N = 60,000, r = 3500, b = 6100, and n is the number pingpong balls that are both red and blue.

Okay, you’re right. I see your point. I had to make more significant changes to my code (calculating 60000! is slow, whoda thunk) though. Basically, for calculating a! / b! / (a - b)!, I calculate (a * (a-1) * … * (a-b+1)) / b!, or (a *(a-1) * … * (b+1) / (a - b)!, depending upon whether or not b > a - b. For those following at home, the new code:


define f(a, b) {
 if (a > b)
  return (a * f(a - 1, b));
 return (1);
}

define c(a, b) {
 if (b > a - b)
  return (f(a, b) / f(a - b, 1));
 return(f(a, a - b) / f(b, 1));
}

define m(t, a, b, n) {
 return (c(b, n) * c(t - b, a - n) / c(t, a));
}

So, with t = 60000, a = 3500, and b = 6100 for values of n, using 450 bits of precision for intermediate results:
354: m = 2.290979%
355: m = 2.298467%
356: m = 2.298326%
357: m = 2.290574%

So, it still peaks at n = 355 rather than n = 356, but the difference between n = 355 and n = 356 is far less than the difference between treating the distribution as binomial rather than hypergeometric.

Interesting. I was sure it was going to peak at 356, but I guess not. Thanks, Punoqllads.

Well good, that saves me some work :slight_smile: Did have a fascinating time digging up simple approximations to use in place of really huge factorials (Stirling’s Formula) though!