I am trying to work out the expected frequency distribution in the following scenario:
Start by randomly picking a whole number between 1 & 10, inclusive. This number will be known as the “comparison” number, and this number will never change for the remainder of the exercise.
Now, keep randomly picking a whole number between 1 & 10 until the number you’ve picked matches the comparison number. Keep track of how many “attempts” it takes for this newly randomly selected number to match the comparison number. Did it take three attempts? Fifteen attempts? Did it match on the first attempt? Do this over and over, noting how many attempts it took each time for the newly selected number to match the comparison number. Each time there’s a match, populate the result (ie, number of attempts to get a match) in to a frequency distribution table.
My question is, if you made billions and billions of attempts to match with the comparison number, and you plotted all the number of attempts it took on a distribution chart, what would the frequency distribution look like?
I’ve written a computer program to simulate the scenario until ten million matches get made, but I get (seemingly) weird results each time I run it. Eg, 3, 8, 9 & 12 attempts always come out as the most common results. I was expecting a fairly even distribution between 1 & 10 attempts, and then gradually tail off from 11 attempts and higher. But that’s not what happens. Instead, I see no consistency or pattern in the frequency distribution.
There’s either something wrong with my computer program, something wrong with my random number generator, or the results are correct, but strange.
I’m curious to know what the actual frequency distribution should look like for this scenario, perhaps from someone who knows how to crunch the math, as opposed to me who only knows how to write potentially buggy computer programs. Thanks.
(Below is a typical frequency distribution table generated from my computer program). # attempts before getting a match to the comparison number:(Iterates until 10 million matches to the comparison number are made)
At each attempt, you have a 1/10 chance of matching the comparison number. This is therefore a geometric distribution with p = 1/10; and the probability that your first success is on attempt #n is (1 - p)[sup]n-1[/sup] * p = 9[sup]n-1[/sup]/10[sup]n[/sup].
In particular, the probability of needing n attempts before a success should always be decreasing with n, but your results don’t really show that. I would agree there’s probably something wrong with your software.
Thanks markn+, that’s the nice even distribution I was expecting to see, although I had assumed there’d be an even distribution for the first ten attempts.
Perhaps something not quite right with my random number generator.
As far as I can tell, PHP (7.1.0 and later) use a Mersenne Twister RNG, so it ought to produce pretty good quality random numbers. I’d look at your own code first. What happens if you just translate my Perl code to PHP?
EDIT
Actually I should say, worked a lot better. The frequency distribution starts off smooth but then gets “lumpy”. Either way I’ll read up on how these modern RNGs work. Cheers.
My guess is that you were seeding it with the time, and re-seeding it for every trial, except that the trials were happening so quickly that you got many different trials with the same seed.
I thought the question was interesting enough that I decided I’d take a stab at coding a simulation in Python 3.7, whose PRNG is also the Mersenne Twister. Here are my (poorly formatted) results. After ten million iterations, the four columns below represent: the number of trials until success, the number of iterations in which success occurred in exactly the given number of trials, the observed frequency, and the expected frequency, per the observation by MikeS given in post 2 that this is a geometric distribution.
import random
comp = random.randint(1, 10)
L = [0] * 150
for t in range(10000000):
i = 0
n = -1
while n != comp and i < 150:
n = random.randint(1, 10)
if n == comp:
L* += 1
else:
i += 1
for m in range(0, 150):
op = float(L[m] / 10000000)
ep = (9 ** m) / (10 ** (m + 1))
print(m + 1, L[m], op, ep)
May I ask a few follow-up questions? It is my understanding that PHP, Perl, and Python are all interpreted languages. Is there a speed advantage of any of these languages over the other, all else being equal? I understand the answer may be processor dependent. My code ran on my decidedly middle-of-the-road processor in about six minutes. That feels glacial. If my code could be optimized, I would appreciate any insight. Would others be willing to share how long it takes their code to run?
When were you seeding the RNG? For what you describe, you should only seed it once at the beginning of the run, not for each trial to find the next match. Reseeding a RNG inappropriately can mess up the randomness.
That does seem slow. My Perl version takes about 15 seconds on a Linux VM running on a Windows host on a 3.4 GHz Xeon CPU. Your final loop does a lot more than mine does, but it’s only iterating 150 times so I doubt that’s the issue. I don’t know much about python – does range(10000000) actually create a list of 10 million numbers in memory? If so, that’s probably the issue – change it to an incrementing for loop.