How random is random enough?

Okay, I’ll take the easiest one. (I’m Swedish so I’m using “,” as “comma”. I believe many of you use “.” instead.)

X = the total number of heads from the 100 flips

X~bin(100;0,5)

P(X=50) = 100C500,5^500,5^50 = 0,07959

The probability of getting exactly 50 heads from 100 flips is 8%.
I’ll try to explain so that you can do this by yourself.

X is binomial distributed because there are n number of “attempts” where either the event A (heads in this case) or Ac (A complement, or not heads) can happen. The event A has the same probability in all attempts (here 0,5) and the result in an attempt does not change the probability of any other attempt, they are independant (a heads on your first flip for example, does not affect the probability of heads in the next one).

So,
n=100
A=heads
P(A)=0,5
independant

X~bin(n,p) <— X is binomial distributed and we note the n and p because we’ll use them in our formula later.

Now we can calculate the probability for any number of heads, from 0 to 100. The general formula is P(X=x)=nCxp^x(1-p)^n-x

The nCx part is a mathematical function you can probably find on your calculator unless it’s a really cheap one. If you don’t, you’ll have to put in the numbers on your own. n! / (x! * (n-x)!) it looks more difficult than it is because of the way you have to put it in this format.

n! is of course the faculty of n, or n*(n-1)(n-2)1 or with an example, 4! = 4321. I’m pretty sure there’s a n! on every calculator. There’s one on the windows calculator for example (advanced mode). So you’d type in: 100 [n!] [/] [(] 50 [n!] [li] 50 [n!] [)] [=] where calculator functions are in [brackets]. You end up with a rather large number but then you continue with the original formula multiplying with 0,5^50 twice. If you do it right you should be left with the same number I got before (0,07959).[/li]Then you ask what P(45<=X<=55) is.

If you want an exact answer, do P(X=45), P(X=46) and so on until P(X=55) and add the probabilities up. Or, if you’re content with a pretty good but not exact answer approximate with the normal distribution. I’ve written too much already, so I won’t do that one now. (read: I did it but got numbers that were obviously not correct and am too tired to see what I did wrong.)

Hopefully the rest is relatively free of errors and not too incomprehensive.

  1. Having no specific pattern, purpose, or objective: random movements. See Synonyms at chance.
  2. Mathematics & Statistics Of or relating to a type of circumstance or event that is described by a probability distribution.
  3. Of or relating to an event in which all outcomes are equally likely, as in the testing of a blood sample for the presence of a substance.
    But that’s just off the top of my head, you understand. :wink:

As others have pointed out, random does not require the distribution to be even. A coin toss is even, but it’s quite possible to throw 2 tails then 98 heads in a row. The outcome remains random but in no way uniform

If Access is using a true random number table any outcome is possible.

Fitting the data in the OP to a uniform distribution yields:

[symbol]c[/symbol][sup]2[/sup]/N[sub]d.o.f.[/sub] = 1.096

Given that N[sub]d.o.f.[/sub]=9, the probability p of an underlying uniform distribution producing data at least this non-uniform is:

p = 36.2%

So, more than 1-out-of-3 of your 2,000-draw experiements would result in this much or more deviation from uniformity.

Note that one cannot give the probability that the underlying distribution is actually uniform without bringing in priors which will, in this case, certainly be highly subjective. (I’m pretty sure the K-S test (suggested in above posts) derives from uniform priors, which in my (subjective) opinion are pretty bad for this situation. I’d place a high prior probability on the hypothesis that MS Access 2000 uses a random number generator that is sufficiently random for the OP’s research purposes (where “sufficient” can be suitably defined.))

That is exactly right.

Results from 10,000 selections:


Record    NumberOfDups
1           961
2           1054
3           1036
4           1037
5           1003
6           1012
7           986
8           964
9           944
10          1003

Largest variation - record 2 - by just over 5.4% this time.

Results for 20,000 selections:


Record    NumberOfDups
1           1968
2           2017
3           2021
4           1981
5           2004
6           1977
7           2015
8           2042
9           1992
10          1983

This time its record 8 - by just 2.1%

I am preparing to take a random sample of records in our database for someone who wants to conduct a survey. What is clear is that my first sample was way too small to be drawing conclusion with, and as the number of selections are increasing, the variations are getting smaller.

I suppose the problem of representivity is something that we will need to have a look at, as there are several different categories of records within the database, and a truly random sample could be just as easily skewed as reprasentative couldn’t it?? How could I ensure that the sample is representative?

Thanks for all the interesting responses folks.

Grim

Okay, these are the Chi squared statistics for the three samples:

  1. 10.11
  2. 11.892
  3. 2.501

(Note that in my first post to this thread, I miscalculated the statistic for sample 1 as 9.91. This time I didn’t manually enter the data into a calculator, I cut-and-pasted into Excel, so I’m reasonably confident there’s no error in my calculation.)

Now, those numbers should all be “close” to 10. There are both upper and lower bounds to “close”. In my first post, I said that the first sample was in effect “close enough” because it was less than the upper bound that I looked up in a table.

The third sample is interesting, because it looks too far away from 10 on the lower side. Rather than looking up the table, which I’d probably stuff up anyway, I’ll invoke the rule-of-thumb given in Sedgewick which is that “close enough” to 10 means being within +/- 2 root(10) of 10, i.e. 10 +/- 6.3 = 3.68 to 16.32.

The statistic for the third sample, 2.501, is less than 3.68, so it’s suspicious. However, we’re testing to 5% confidence, so 1 in 20 times we could expect to get such a low value. We’ve got 3 samples, two of which are good, and one of which is “bad”.

The above is simply of academic interest, because the Access database almost certainly uses the rand function from the MS RTL which has been tested thoroughly by an entire generation of Com Science students.

Conclusions:

  1. If you want to pick a record at random, the built in randomizing function is probably adequate.

  2. If you want to pick, say, 100 records at random, you aren’t doing it the right way. You have to sample without replacement, as suggested by someone above. Otherwise, you could pick the same record more than once, which would fuck-up your data sample.

Thank you - I will indeed need to implement “sample without replacement”.

Grim

Sampling without replacement is essentially a shuffle. The easiest way I can think of to do this in Access is to add a field of random numbers. Then sort by that field and pick the top 100 (or whatever) records.