Let’s say that I put slips of paper numbered one through 100 in a jar, and draw them out two at a time. Of each pair of numbers, one will be larger than the other; let’s say that I toss out whichever is the smaller number. After I’ve done this fifty times I now have fifty slips of paper left. What can be said about the (probable) distribution of the numbers? On the one hand each drawing eliminated the smaller of the two numbers; but on the other hand there was no control over which numbers were paired. 99 could have been eliminated in favor of 100, and 2 could have been kept by eliminating 1. So have the numbers been skewed towards larger values or not?

They’re certainly skewed towards larger numbers. If nothing else, 1 is sure to have been thrown out, and 100 is sure to have been kept. And while it’s possible that 2 was kept or 99 thrown out, it’s very unlikely, and so on to diminishing degrees for numbers proceeding toward the middle values.

Just *how* skewed they would be, I don’t know off the top of my head. If I had to guess, I’d say that the new average would be 63.2 .

I assume you mean with no replacement?

The probability that each number remains equals the probability that it’s the higher number of its pair. The probabilities aren’t completely independent, but I think will tend toward independence as the size of the set increases. 100 is large enough that they are nearly independent, I think.

100 - probability that it remains = 99/99 (all possible partners are lower)

99 - 98/99 (98 of the possible partners are lower)

98 - 97/99

97 - 96/99

.

.

2 - 1/99

1 - 0/99

So the probability that a number remains is proportional to its value.

The probabilities that given numbers remain is far from independent. If 99 is thrown out, it must have been paired with 100. In that case it is certain that 98 remains because it can’t be paired with either of them.

Obviously not *completely* independent, as I said. But you say “far from” independent without any justification - your example doesn’t quantify *how* far from independence. All you’ve given is a case where a p=1/99 event changes the probability for the number 98 by a tiny amount from p=97/99 to p=1.

It seems to me that it must tend toward independence as the size of the set increases, and my intuition is that it’s pretty close to independence for a set of 100.

I wrote a quick program to check this. Over 1000 trials, the average was 67.273.

Over 100,000 trials, the average was 67.331.

For 100 numbers, the mean value works out to 67+1/3.

The general formula for n (where n is even, of course), is:

[ sum(i=2 to n) [n*(n-1)] ] / [(n-1)*(n/2)]

For example, for 100:

```
[2*(1/99)+3*(2/99)+4*(3/99)+.....+100*(99/99)] / 50
```

One way of thinking about this is that the summation a type of expected value for this sum and you divide by the 50 to account for a multiplicity.

Haven’t worked out the limiting behavior, but it appears the mean trends toward 2/3 of the maximum value. E.g. for 1000 it’s 667+1/3, for 10000 it’s 6667+1/3, for 100000 it’s 66667+1/3, and so on.

The analytic result assuming the probabilities I stated above is 67 1/3.

(ninjaed)

Forgot to note, you can go ahead and use independence. The drawing of an individual pair after the first pair has been drawn won’t be independent, but that’s not the right trial to look at.

The probabilities you state are for a ‘trial’ where a single sample is the game in its entirety, i.e. all 50 drawings of pairs as a set is a single ‘trial’. In this case, separate trials are independent and the probabilities stated are correct.

You may assume independence for calculating the mean, but you may not do so to find the joint probability distribution of the 100 0-1 variables denoting included or not.

Not sure why this is important.

It’s not necessary if we just want the overall probability distribution for the resulting 50 balls, which **Riemann** gave. Those are the probabilities a given ball remains at all, which was the question posed in the OP. We don’t need to delve into cases where a number is included or not, and actually this can lead to a poorer intuitive understanding of the problem. The probability distribution was given directly.

One way to think of the distribution of remaining numbers is the permtuations of 100 numbers taken 50 at a time, i.e. P(100,50). We do need to divide this by 2^50, because drawing 1-2 is the same as 2-1 and there are 50 pairs. So we have P(100,50)/2^50 possible unique drawings of pairs out of 100 numbers. Of those, we already have the probability that ‘2’ shows up is 1/99 of these. Likewise, of the P(100,50)/2^50, ‘3’ shows up 2/99 of those, ‘4’ shows up 3/99 of those, etc.

If it makes it easier, we can reduce this problem. For 4 numbers - 1,2,3,4, we can actually just list all possible outcomes up to symmetry:

[(1,2)(3,4)] [(1,3)(2,4)] [(1,4)(2,3)]

the results for each pair, respectively are [2,4] [3,4] [3,4].

By our analysis above, there should be P(4,2)/2^2 = 3 outcomes, which checks out. ‘2’ remains 1/3 of the time. ‘3’ remains 2/3 of the time. ‘4’ remains 3/3 of the time.

We can extend to 6 numbers - 1,2,3,4,5,6 and list all 15 possible outcomes, which again matches P(6,3)/2^3:

[(1,2)(3,4)(5,6)][(1,2)(3,5)(4,6)][(1,2)(3,6)(4,5)][(1,3)(2,4)(5,6)][(1,3)(2,5)(4,6)]

[(1,3)(2,6)(4,5)][(1,4)(2,3)(5,6)][(1,4)(2,5)(3,6)][(1,4)(2,6)(3,5)][(1,5)(2,3)(4,6)]

[(1,5)(2,4)(3,6)][(1,5)(2,6)(3,4)][(1,6)(2,3)(4,5)][(1,6)(2,4)(3,5)][(1,6)(2,5)(3,4)]

The corresponding outcomes:

[2,4,6][2,5,6][2,5,6][3,4,6][3,5,6]

[3,5,6][3,4,6][4,5,6][4,5,6][3,5,6]

[4,5,6][4,5,6][3,5,6][4,5,6][4,5,6]

Counting these, ‘2’ remains in 3/15 or 1/5 of all possible drawings, ‘3’ remains 6/15 or 2/5 of all possible drawings, ‘4’ remains 3/5 of all possible drawings, ‘5’ remains in 4/5 of all possible drawings, and ‘6’ remains in 5/5 of all possible drawings.

The same analysis holds for n=8 or n=10 or any even n.

The pattern is clear for any ‘n’. ‘2’ remains 1/(n-1) times, ‘3’ remains 2/(n-1) times, ‘4’ remains 3/(n-1) times, etc.

Every possible outcome is listed for these examples, so we don’t have to worry about individual trials, and these are indeed the overall probabilities. We don’t have to worry about joint probabilities because, as I have noted, that’s within a single ‘trial’ and not really relevant for the overall problem. If we consider all 50 pair drawings as a single ‘trial’, we have our answer.

Don’t even worry about limiting behavior, because that always reduces to 2(n+1)/3 . Or, where it’s m to n, instead of starting with 1, it’s 2(n+m)/3.

EDIT: Oh, and for anyone worried about “non-independence”, you’d expect that to be most significant in the small n cases. So manually work it out for 2 and 4, both of which are easy, to double-check. For n=2, you always have just the 2 slip left. And for n=4, you always have the 4 slip, and either the 2 slip (if it was paired with 1, a 1/3 chance) or the 3 slip (in the other 2/3 chances).

Or rather, the mean is always exactly ⅔ of (max_value + min_value), e.g. ⅔ * (100+1) in this example.

This is a key point. The *first* pairing is obviously independent of the others … and the pairings can be ordered such that any pairing becomes the first pairing! (What’s the best way to express and establish this property eloquently?)

Ah, indeed so.

Where were you when I was switching between indexing-by-0 and indexing-by-1?

And once again, I see my decision to major in English was fortuitous.

I’ll take a large fries, please.

It’s interesting I think, that when I read the question, my thought was that the answer would probably be around two thirds. My maths is not up to doing the calculation but now I see that others have admirably demonstrated how to do it and that my instinct was correct.

And my (incorrect) instinct was 1-1/e. I apparently spend too much time thinking about calculus, and not enough about geometry: It’s basically just the centroid of a triangle problem.

Triangle? I can’t handle these complex multi-dimensional objects. Fortunately, this is a simple problem that takes place on a 1-dimensional line segment:

**Choosing n uniform random points on the [0,1] line segment, expect to find the k’th smallest at location k/(n+1).**

OP’s is a variation on the

*largest of two problem*, so k = n = 2 and

*k/(n+1) = ⅔*.

A sort of geometrical way of way of looking at it gives the result that the expected value is twice the ratio of the (n-1)th tetrahedral number to the (n-1)th triangular number.

Make an n by n grid of squares with position in the x direction representing the lower number of a pair and the y position representing the larger number of a pair. The triangle of squares to the upper left of the north-east diagonal represents the possible pair selections. The number of squares in this area is n*(n-1)/2, which is the (n-1)th triangular number, since they from a triangle with n-1 squares on a side. In an exhaustive list of every permutation of selecting n/2 pairs, each pair represented by one of these n*(n-1)/2 squares will appear an equal number of times, and therefore only need to average over the squares to find expected value.

Row number m has m-1 squares of possible pairs that represent m being the higher number of the pair. Imagine stacking m unit cubes on each of the squares in row m. The m*(m-1) squares in row m may be arranged into a two triangles of side (m-1), because m*(m-1)/2 is the (m-1)th triangular number. The list of possible pairs allows m to be any of the m-1 integers from 2 through m. The triangles of side 1 through n-1 can be stacked into two tetrahedrons of side n-1, and the total number of cubes is therefore twice the (n-1)th tetrahedral number, or 2*(n-1)*n*(n+1)/6.

The resulting expected value is number of cubes, 2*(n-1)*n*(n+1)/6, divided by the number of pairs in the array, n*(n-1)/2, giving 2*(n+1)/3.