How would I calculate the probability of a random 6 digit number repeating?

FlikTheBlue · May 14, 2021, 10:26pm

I deal with a few apps whose purpose is to generate a random 6 digit number every 30 seconds. That got me wondering about the likelihood of ever seeing any one of those numbers again with at least 50% probability. Here is the specific problem, and the question regarding it.

A random 6 digit number is generated.
A 2nd random 6 digit number is generated.
We keep generating random 6 digit numbers.

How many random 6 digit numbers do we have to generate until we reach a point where there is a 50% probability that any two of the previous numbers will match? Intuitively it would seem like the answer would be half of them, or 500,000, but I think that’s just if I want to know if the most recently generated number will match any of the prior numbers. The question, however, is not whether just the most recent number matches, but if any of the previous numbers match. Is it actually 500,000, or is it some smaller number, and if so, how would we set up the calculation to find that number?

OldGuy · May 14, 2021, 10:59pm

This is basically a numerically larger version of the birthday paradox that says you only need to have 23 people chosen at random so the the probability that two of them share a birthday is more than one half.

The basic idea is suppose there are no matches in the first n numbers, what’s the probability there is no match with the n+1st. Set N = the number of possible numbers. In your case N = 1000000. The probability that the n+1st doesn’t match conditional on the first n not matching is (N-n)/N = 1-n/N because there are N-n numbers not already picked and there are N possibilities.

So the probability of no matches in the first n+1 is 1*[(N-1)/N]{(N-2)/N]…[(N-n)/N] = N!/[N^n (N-n)!] You want to find n so this probability is smaller than 1/2. (Smaller because recall we’re finding the probability of no match here).

This should be (assuming I did Stirling right) 1 - exp(-n^2/2N) = 1178,but I’m running late and didn’t check.

TheGunIsMightierThanThePen · May 14, 2021, 11:04pm

Yep, like OldGuy said, it’s a generalized version of the classic birthday problem. Using the n(p; d) approximation formula given in this section of that Wikipedia article, I get ≈ 1177 numbers generated before there is a 50% chance of a duplicate.

It’s a lot smaller than you think.

Dr.Strangelove · May 14, 2021, 11:12pm

As a much cruder approximation than what OldGuy and TheGunIsMightierThanThePen gave, we know the number of pairs in a set of N numbers scales as N² (it’s actually N(N-1)/2)). We can guess that when the number of pairs equals the total number of possibilities, the odds are pretty good that we’ll start seeing dupes at that point. So if there are 1000000 total numbers, we’ll expect dupes at around N=sqrt(1000000)=1000 (not far from the given 1177).

This has implications for computer algorithms and security. Hashing is a process where you generate a number based on some input text, and a collision is when two different data inputs hash to the same value. Collisions are bad, and happen more frequently than one might expect. A 20-digit hash value might seem to never have a collision. But if you have a just few billion entries (not that much), it becomes very likely.

Riemann · May 14, 2021, 11:28pm

Just to add some intuition to the numerical answers above…

This would be the solution only if you specified some target number to match beforehand. For example: “how many random numbers do I have to generate until there’s a 50% chance I come up with an instance of 999999?”

But you haven’t specified any target that you are matching to, so any pair can match.
After 2 numbers, there is obviously just 1 possible pair to match.
After 3 numbers, you can match (1 with 2), (1 with 3), (2 with 3) = 3 possible matches.
After 4 numbers, (1-2),(1-3),(1-4),(2-3),(2-4),(3-4) = 6 possible matches
.
.
After 10 numbers, 45 possible pairs to match
.
.
After 100 numbers, 4950 possible pairs to match
.
.
After 1000 numbers, 499500 possible pairs to match
.
.

The number of possible pairings grows very rapidly as the sample size increases, that’s why you reach a 50% probability of match more quickly than you might expect. And you can see in the precise answers above that you reach 50% probability in the vicinity of 1000 numbers, which is around the point at which there are about 500,000 possible pairings between the 1000 numbers.

The_Librarian · May 20, 2021, 9:27am

When looking at patterns in “random” numbers it starts to become very important to examine how those numbers are generated. Computers are not very good at random.

Chronos · May 20, 2021, 11:02am

Computers can be good at random, if you go to a lot of effort to make them random. And in any event, they’re a lot better at random than humans are.

FlikTheBlue · May 20, 2021, 4:03pm

These particular numbers are generated by apps as part of a multi-factor authentication for prescribing controlled substances. The app generates a random 6 digit number every 30 seconds, which must be entered in the electronic prescribing program to authenticate. that I’m the one sending in the prescription. I have no idea how the app actually generates the number.

DPRK · May 20, 2021, 4:12pm

That is obviously not a random number. If it were, the program would not be able to authenticate it.

Sounds like a straightforward time-based HMAC-based one-time password.

Squink · May 20, 2021, 8:06pm

Pseudo random numbers, as generated by most algorithms repeat more often than you might think.
Listening to “shuffled” Playlists or “random” sllideshows gives the impression that whatever collection is being sampled is far smaller than it really is. Most random number generators will repeat some numbers more often than others. Using an algorithm that chooses Without Replacing will make your Playlist or sideshow appear much larger than the usual pick and replace algorithm.

DPRK · May 21, 2021, 4:04am

Pseudorandom numbers may not be random, but they are supposed to be equally distributed, like random numbers. What makes your playlist appear longer or shorter is the difference between a (pseudo)random permutation and (pseudo)randomly selecting songs as in the coupon collector, which takes on average about n log n + γn + 1/2 to play all the songs, so it seems like your playlist is several times longer.

Topic		Replies	Views
Several probabilities questions Factual Questions	7	1814	November 27, 2012
Statistics question -- matching phone numbers Factual Questions	15	1569	October 16, 2007
Odds question Factual Questions	4	501	December 10, 2018
Probability/argument with g/f Factual Questions	32	1501	October 9, 2003
Birthday odds Factual Questions	4	676	December 15, 2000

How would I calculate the probability of a random 6 digit number repeating?

Related topics