Probablily calculation question

I have a population x and
I want to select y items and
The items in population x are reusable, similar to the “Birthday Paradox,” how do I calculate the expected number of unique items?

Your question is unclear. Are you asking about ‘selection’ with replacement, and what the odds are of getting a previously selected item?

Stranger

Well, as an example,I have 100 items, and I select them one at a time, put them back, and reselect. If I do that 100 times, I’d expect to have selected maybe 60-65 unique items. Wondering what the general formula is.

What you are describing is a cascading chain of conditional probabilities, from which you could laboriously derive a composite probability distribution stretching between the two extremes of 100 unique items to selecting the exact same item 100 times in a row. Those two cases are the easiest to solve because they are linear chains, and everything in-between is branching, You could develop a formula for calculating the likelihood of getting y unique items in x draws by summing over all of the possible permutations but it won’t be a simple formula even if you go through the gyrations to reduce it down.

From a practical standpoint, unless you want the academic exercise of working through a graduate-level problem in combinatorics, I would just set up a Monte Carlo simulation, run it a few million times, and trust the central limit theorem to give you a close approximation of that distribution.

Stranger

The limit is 1 - 1/e (roughly 63.2%).

Indeed so. It can be re-framed to be equivalent to something called a Derangement in Combinatorics (a term I always appreciated).

And you tend to hit it pretty fast. After about 7 or 8 trials, you’re probably in the neighborhood of that limit already.

If all you are interested in is the expected value, its not actually too hard.
For a given element i of the population let’s define a variable that is equal to 1 if that sample is seen and 0 if it isn’t. The total number of observations seen is going to be equal to the sum of these variables.

And for each of these variables, the probability that a that variables will have a value of 1 is given by:

P(i will be seen)
= 1-P(i will not be seen in any of the y draws)
= 1-P(i will not be seen in one of the draws)^y
= 1-P(1-1/x)^y

Now as @Stranger_On_A_Train says, all of these variables are conditionally dependent. But if all we are interested in is the expected value, then it doesn’t matter. Regardless of the dependency the expectation of a sum of variables, is equal to the sum of their expectations.

So since each of the X elements of population has an expectation equal to 1-P(1-1/x)^y the total number of unique hits will be equal to x*(1-(1-1/x)^y).

This is true only if y is approximately equal to x, since (1-1/x)^x approaches 1/e

In general if x and y are large the proportion hit will be close to 1-exp(-y/x)

To late to edit, this should read
the total expected number of unique hits will be equal to x*(1-(1-1/x)^y

Thanks, and actually with some random (not to start that debate!) testing, that pretty much reflects the results I found.

I should hope so! The numbers should match the theory.

But one problem that can exist with empirical testing is determining whether or not one has enough samples to form a general rule, which I imagine is where you were sitting before the thread.

Is this what you’re looking for?
Poisson distribution. Given N possible events, and Q random picks, What is the probability of “x” hits
(e.g. Q=1m lottery players choose from N=3.9m choices, what are the odds of 2 winners)
The limiting as N approaches infinity …
A = Q/N (A is also the expected value or mean)
P(x) = (e ** -A * A**x) / x!

Thanks, but no. For what I was looking for, that looks a little fishy to me. And . . . this is not to open another discussion on the nuances and technicalities on what constitutes ‘random.’ For this situation, it’s random enough!

Specifically where this came up was at my work, a person in another department was needing to do a ‘random’ sampling of 150 pieces of equipment out of a population of 250. I was somewhat familiar with the process they used, and I knew that the software a) often generated a previously used number, and b) expected duplicatesm especially with large samples of small groups, and just grabbed another random number if one had already been used. So, in a case of 150 out of 250, it might have to generate 200+ random numbers to get 150 unique ones.

So, I was just curious as to how many unique values to expect if it stopped at 150.

The way to do this is not to repeatedly generate random integers until you get 150 distinct ones.

Generate 250 random real numbers, such as in the interval [0-1], assigning a result to each machine. Then sort by the random numbers. Sample the machines with the lowest 150 numbers. You can to this pretty easily in Excel.

This is a “shuffle” algorithm rather than a “selection with replacement” algorithm.