What is the expected frequency distribution in this scenario?

I am trying to work out the expected frequency distribution in the following scenario:

  1. Start by randomly picking a whole number between 1 & 10, inclusive. This number will be known as the “comparison” number, and this number will never change for the remainder of the exercise.

  2. Now, keep randomly picking a whole number between 1 & 10 until the number you’ve picked matches the comparison number. Keep track of how many “attempts” it takes for this newly randomly selected number to match the comparison number. Did it take three attempts? Fifteen attempts? Did it match on the first attempt? Do this over and over, noting how many attempts it took each time for the newly selected number to match the comparison number. Each time there’s a match, populate the result (ie, number of attempts to get a match) in to a frequency distribution table.

My question is, if you made billions and billions of attempts to match with the comparison number, and you plotted all the number of attempts it took on a distribution chart, what would the frequency distribution look like?

I’ve written a computer program to simulate the scenario until ten million matches get made, but I get (seemingly) weird results each time I run it. Eg, 3, 8, 9 & 12 attempts always come out as the most common results. I was expecting a fairly even distribution between 1 & 10 attempts, and then gradually tail off from 11 attempts and higher. But that’s not what happens. Instead, I see no consistency or pattern in the frequency distribution.

There’s either something wrong with my computer program, something wrong with my random number generator, or the results are correct, but strange.

I’m curious to know what the actual frequency distribution should look like for this scenario, perhaps from someone who knows how to crunch the math, as opposed to me who only knows how to write potentially buggy computer programs. Thanks.

(Below is a typical frequency distribution table generated from my computer program).
# attempts before getting a match to the comparison number: (Iterates until 10 million matches to the comparison number are made)

1 25159 (this means that 25,159 times, the comparison number was matched on the first attempt)
2 82196
3 1898314
4 82070
5 178978
6 100721
7 72615
8 780732
9 2679468
10 53693
11 110693
12 2122718
13 29810
14 22747
15 42720
16 101152
17 753525
18 26014
19 33398
20 99338
21 482611
22 11773
23 12266
24 25387
25 8910
26 6934
27 9987
28 10390
29 51070
30 4822
31 5038
32 5223
33 25111
34 2471
35 2463
36 2791
37 4199
38 10775
39 1805
40 1441
41 2812
42 5378
43 792
44 732
45 906
46 1600
47 606
48 546
49 548
50 775
51 299
52 300
53 279
54 425
55 238
56 189
57 177
58 183
59 154
60 118
61 131
62 110
63 99
64 78
65 87
66 71
67 86
68 68
69 58
70 37
71 37
72 33
73 33
74 28
75 27
76 26
77 26
78 40
79 22
80 19
81 22
82 22
83 13
84 18
85 20
86 14
87 10
88 7
89 15
90 7
91 8
92 9
93 6
94 5
95 7
96 7
97 5
98 4
99 7
100 4
101 5
102 5
103 5
104 5
105 4
106 2
107 1
108 6
109 1
110 4
111 3
112 2
113 2
114 1
115 3
116 3
118 1
119 2
120 1
121 1
122 1
124 3
126 1
129 2
132 1
134 2
136 1
137 1
138 2
139 1
141 1
142 1
145 1
147 1
149 1
158 1
159 1
163 1
164 2
172 1
183 1
199 1
216 1
235 1
251 1
284 1

At each attempt, you have a 1/10 chance of matching the comparison number. This is therefore a geometric distribution with p = 1/10; and the probability that your first success is on attempt #n is (1 - p)[sup]n-1[/sup] * p = 9[sup]n-1[/sup]/10[sup]n[/sup].

In particular, the probability of needing n attempts before a success should always be decreasing with n, but your results don’t really show that. I would agree there’s probably something wrong with your software.

Ten million is a pretty small sample.

I agree there’s a problem in your simulation software. I wrote a simulation with these results after 10 million iterations:

1 1001594
2 900705
3 808397
4 728512
5 656125
6 589556
7 531566
8 477985
9 431470
10 387422
11 348311
12 313839
13 282297
14 253487
15 227971
16 205678
17 185654
18 166685
19 149567
20 136238
21 122034
22 108782
23 98662
24 88511
25 80056
26 71854
27 64491
28 58569
29 52841
30 47038
31 42708
32 38062
33 34142
34 30826
35 27705
36 25026
37 22316
38 20392
39 18270
40 16468
41 14771
42 13213
43 12040
44 10761
45 9796
46 8636
47 7989
48 7011
49 6319
50 5732
etc. It goes to 150 but you get the idea. The count for each value decreases as expected. Here’s my code (in Perl).



use strict;
my %hist;
my $loops = 10000000;
for (my $loop = 0; $loop < $loops; ++$loop) {
    my $comp = (int rand 10) + 1;
    my $count;
    for ($count = 1;; ++$count) {
        my $pick = (int rand 10) + 1;
        last if $pick == $comp;
    }
    $hist{$count}++;
}
foreach my $count (sort { $a <=> $b } keys %hist) {
    print "$count $hist{$count}
";
}


Thanks markn+, that’s the nice even distribution I was expecting to see, although I had assumed there’d be an even distribution for the first ten attempts.

Perhaps something not quite right with my random number generator.

There are certainly a lot of poor random number generators out there. What language and RNG are you using?

PHP, using the srand() function to seed the RNG.

What happens if you remove the call(s) to srand ?

As far as I can tell, PHP (7.1.0 and later) use a Mersenne Twister RNG, so it ought to produce pretty good quality random numbers. I’d look at your own code first. What happens if you just translate my Perl code to PHP?

Yep, that worked! By not seeding the RNG, I get the smoothe distribution I was expecting.

I guess I need to brush up on how modern RNGs work. Cheers.

EDIT
Actually I should say, worked a lot better. The frequency distribution starts off smooth but then gets “lumpy”. Either way I’ll read up on how these modern RNGs work. Cheers.

Finally, using the mt_rand() function for RNG did the trick.

My guess is that you were seeding it with the time, and re-seeding it for every trial, except that the trials were happening so quickly that you got many different trials with the same seed.

If switching from rand to mt_rand changed things, you must be using an old version of PHP. In newer versions rand and mt_rand are the same thing.

I thought the question was interesting enough that I decided I’d take a stab at coding a simulation in Python 3.7, whose PRNG is also the Mersenne Twister. Here are my (poorly formatted) results. After ten million iterations, the four columns below represent: the number of trials until success, the number of iterations in which success occurred in exactly the given number of trials, the observed frequency, and the expected frequency, per the observation by MikeS given in post 2 that this is a geometric distribution.

1 1000145 0.1000145 0.1
2 899828 0.0899828 0.09
3 808692 0.0808692 0.081
4 728127 0.0728127 0.0729
5 656979 0.0656979 0.06561
6 589358 0.0589358 0.059049
7 531455 0.0531455 0.0531441
8 477694 0.0477694 0.04782969
9 429710 0.042971 0.043046721
10 386893 0.0386893 0.0387420489
11 348832 0.0348832 0.03486784401
12 314924 0.0314924 0.031381059609
13 282752 0.0282752 0.0282429536481
14 255468 0.0255468 0.02541865828329
15 229403 0.0229403 0.022876792454961
16 205531 0.0205531 0.0205891132094649
17 185673 0.0185673 0.01853020188851841
18 167552 0.0167552 0.01667718169966657
19 150294 0.0150294 0.015009463529699911
20 134367 0.0134367 0.01350851717672992
21 121805 0.0121805 0.01215766545905693
22 110062 0.0110062 0.010941898913151235
23 98758 0.0098758 0.009847709021836112
24 88612 0.0088612 0.0088629381196525
25 79483 0.0079483 0.00797664430768725
26 71899 0.0071899 0.007178979876918526
27 64792 0.0064792 0.006461081889226674
28 58308 0.0058308 0.005814973700304006
29 51886 0.0051886 0.005233476330273605
30 47131 0.0047131 0.004710128697246245
31 42502 0.0042502 0.004239115827521621
32 37890 0.003789 0.003815204244769458
33 34363 0.0034363 0.0034336838202925126
34 31059 0.0031059 0.0030903154382632613
35 27755 0.0027755 0.002781283894436935
36 24723 0.0024723 0.0025031555049932416
37 22512 0.0022512 0.0022528399544939175
38 20289 0.0020289 0.0020275559590445255
39 18179 0.0018179 0.001824800363140073
40 16622 0.0016622 0.0016423203268260658
41 14896 0.0014896 0.0014780882941434592
42 13264 0.0013264 0.0013302794647291132
43 11991 0.0011991 0.001197251518256202
44 10769 0.0010769 0.0010775263664305817
45 9641 0.0009641 0.0009697737297875237
46 8900 0.00089 0.0008727963568087712
47 7845 0.0007845 0.0007855167211278942
48 7236 0.0007236 0.0007069650490151047
49 6421 0.0006421 0.0006362685441135942
50 5770 0.000577 0.0005726416897022348

I’ll show my work, too:


import random
comp = random.randint(1, 10)
L = [0] * 150
for t in range(10000000):
    i = 0
    n = -1
    while n != comp and i < 150:
        n = random.randint(1, 10)
        if n == comp:
            L* += 1
        else:
            i += 1
for m in range(0, 150):
    op = float(L[m] / 10000000)
    ep = (9 ** m) / (10 ** (m + 1))
    print(m + 1, L[m], op, ep)

May I ask a few follow-up questions? It is my understanding that PHP, Perl, and Python are all interpreted languages. Is there a speed advantage of any of these languages over the other, all else being equal? I understand the answer may be processor dependent. My code ran on my decidedly middle-of-the-road processor in about six minutes. That feels glacial. If my code could be optimized, I would appreciate any insight. Would others be willing to share how long it takes their code to run?

Thanks for any feedback.

Post #2 explains why this is not the case.

When were you seeding the RNG? For what you describe, you should only seed it once at the beginning of the run, not for each trial to find the next match. Reseeding a RNG inappropriately can mess up the randomness.

That does seem slow. My Perl version takes about 15 seconds on a Linux VM running on a Windows host on a 3.4 GHz Xeon CPU. Your final loop does a lot more than mine does, but it’s only iterating 150 times so I doubt that’s the issue. I don’t know much about python – does range(10000000) actually create a list of 10 million numbers in memory? If so, that’s probably the issue – change it to an incrementing for loop.