How would I calculate the probability of a random 6 digit number repeating?

I deal with a few apps whose purpose is to generate a random 6 digit number every 30 seconds. That got me wondering about the likelihood of ever seeing any one of those numbers again with at least 50% probability. Here is the specific problem, and the question regarding it.

A random 6 digit number is generated.
A 2nd random 6 digit number is generated.
We keep generating random 6 digit numbers.

How many random 6 digit numbers do we have to generate until we reach a point where there is a 50% probability that any two of the previous numbers will match? Intuitively it would seem like the answer would be half of them, or 500,000, but I think that’s just if I want to know if the most recently generated number will match any of the prior numbers. The question, however, is not whether just the most recent number matches, but if any of the previous numbers match. Is it actually 500,000, or is it some smaller number, and if so, how would we set up the calculation to find that number?

This is basically a numerically larger version of the birthday paradox that says you only need to have 23 people chosen at random so the the probability that two of them share a birthday is more than one half.

The basic idea is suppose there are no matches in the first n numbers, what’s the probability there is no match with the n+1st. Set N = the number of possible numbers. In your case N = 1000000. The probability that the n+1st doesn’t match conditional on the first n not matching is (N-n)/N = 1-n/N because there are N-n numbers not already picked and there are N possibilities.

So the probability of no matches in the first n+1 is 1*[(N-1)/N]{(N-2)/N]…[(N-n)/N] = N!/[N^n (N-n)!] You want to find n so this probability is smaller than 1/2. (Smaller because recall we’re finding the probability of no match here).

This should be (assuming I did Stirling right) 1 - exp(-n^2/2N) = 1178,but I’m running late and didn’t check.

Yep, like OldGuy said, it’s a generalized version of the classic birthday problem. Using the n(p; d) approximation formula given in this section of that Wikipedia article, I get ≈ 1177 numbers generated before there is a 50% chance of a duplicate.

It’s a lot smaller than you think.

As a much cruder approximation than what OldGuy and TheGunIsMightierThanThePen gave, we know the number of pairs in a set of N numbers scales as N2 (it’s actually N(N-1)/2)). We can guess that when the number of pairs equals the total number of possibilities, the odds are pretty good that we’ll start seeing dupes at that point. So if there are 1000000 total numbers, we’ll expect dupes at around N=sqrt(1000000)=1000 (not far from the given 1177).

This has implications for computer algorithms and security. Hashing is a process where you generate a number based on some input text, and a collision is when two different data inputs hash to the same value. Collisions are bad, and happen more frequently than one might expect. A 20-digit hash value might seem to never have a collision. But if you have a just few billion entries (not that much), it becomes very likely.

Just to add some intuition to the numerical answers above…

This would be the solution only if you specified some target number to match beforehand. For example: “how many random numbers do I have to generate until there’s a 50% chance I come up with an instance of 999999?”

But you haven’t specified any target that you are matching to, so any pair can match.
After 2 numbers, there is obviously just 1 possible pair to match.
After 3 numbers, you can match (1 with 2), (1 with 3), (2 with 3) = 3 possible matches.
After 4 numbers, (1-2),(1-3),(1-4),(2-3),(2-4),(3-4) = 6 possible matches
.
.
After 10 numbers, 45 possible pairs to match
.
.
After 100 numbers, 4950 possible pairs to match
.
.
After 1000 numbers, 499500 possible pairs to match
.
.

The number of possible pairings grows very rapidly as the sample size increases, that’s why you reach a 50% probability of match more quickly than you might expect. And you can see in the precise answers above that you reach 50% probability in the vicinity of 1000 numbers, which is around the point at which there are about 500,000 possible pairings between the 1000 numbers.

When looking at patterns in “random” numbers it starts to become very important to examine how those numbers are generated. Computers are not very good at random.

Computers can be good at random, if you go to a lot of effort to make them random. And in any event, they’re a lot better at random than humans are.

These particular numbers are generated by apps as part of a multi-factor authentication for prescribing controlled substances. The app generates a random 6 digit number every 30 seconds, which must be entered in the electronic prescribing program to authenticate. that I’m the one sending in the prescription. I have no idea how the app actually generates the number.

That is obviously not a random number. If it were, the program would not be able to authenticate it.

Sounds like a straightforward time-based HMAC-based one-time password.

Pseudo random numbers, as generated by most algorithms repeat more often than you might think.
Listening to “shuffled” Playlists or “random” sllideshows gives the impression that whatever collection is being sampled is far smaller than it really is. Most random number generators will repeat some numbers more often than others. Using an algorithm that chooses Without Replacing will make your Playlist or sideshow appear much larger than the usual pick and replace algorithm.

Pseudorandom numbers may not be random, but they are supposed to be equally distributed, like random numbers. What makes your playlist appear longer or shorter is the difference between a (pseudo)random permutation and (pseudo)randomly selecting songs as in the coupon collector, which takes on average about n log n + γn + 1/2 to play all the songs, so it seems like your playlist is several times longer.