I am using the Randomise function to select records at random from a list (to conduct a random sample survey for research purposes). To test how random the results were, I selected 2000 times from a list of 10 records and got the following results:
Obviously I don’t expect the results to be exactly evenly distributed, but 179 seems rather low for example (10% off the average). Is this good enough, or is there a better way of going about it?
I can’t be bothered to actually work out how likely being 10% off is. But I’m not at all surprised - that feels like a perfectly normal variation.
Is it good enough? Well, what do you want to use it for? A ‘random sample’ - is that something like you find 10 records out of your database and then go and find out something more about those? You could possibly improve it by selecting 10 ‘representative’ records, that depends.
Access, AFAIK, generates pseudorandom numbers using random number tables. These are more than close enough to true random for what you are doing.
You shouldn’t expect a random sequence to have an even distribution, they should be, well, random. Sometimes you will get unusually high or low results.
You could try running a mean squares successive difference test to see if the results you got truly vary at random, but I can tell you now that if you did it properly using Access they will do. The only possible reason they wouldn’t is an error in your code.
Actually, I wouldn’t mind a more detailed explanation of how to determine expected, or reasonable, number distribution in a general problem of this type.
Other problems, for example:
Flip a coin 100 times; what are the chances of getting a 50/50 split? Or anywhere between a 45/55 and 55/45 split?
Take, say, 5 years worth of Lotto data (let’s say it’s 260 weeks, where each week six numbers are piced out of a possible 47). How would you determine what a reasonable distribution of numbers looked like?
This is a question my wife and I were actually discussing this weekend. The local paper had an article on local hospital that was being sued by patients because of its higher-than-average failure rate in kidney transplants (as well as some other issues that are not germane here). So, assuming an overall national average failure rate of, say, 7%, if four local hospitals each do 100-200 transplants, how would one expect their individual failure rates to look, if the differences were due solely to chance?
Seems like these three questions could be answered by the same technique as the question in grimpixie’s OP.
I can’t yet find a good online explanation, but this is described as a binomial distribution. The normal way of calculating it with 100 things is to approximate it with a normal distribution with the same average and “spread” (variance). WAG: 50/50 is quite unlikely but 45-55 heads is pretty likely.
I take it you took ten numbers(1-10) and selected 2000 from that, with 1 showing up 179, 2 210 times, etc.
Just looking at the results, the only thing I see I wouldn’t have expected is that #10 came out exactly 200. That seems odd to me, but maybe not.
How time consuming is it to run the thing 5,000 or 10,000 times? If the random function is actually random, then you should get an increasing tight result. I just don’t believe you’ve run the thing enough to make a judgment entirely satisfactory to your purpose.
The standard test for a situation like this is (as cited previously) the chi-squared test. Sometimes knows as a “goodness-of-fit” test.
To introduce this idea, in Stats 101, our professor assigned us homework to roll a die 300 times and record the results. Of course no one did it, they just made up some numbers. He wrote the numbers on the board (of the first 10 students), ran the chi-squared tests and proved they were all cheaters. (Or at least statistically very likely to be cheaters.)
The Chi-square statistic is certainly one test, but by no means the only test of randomness. The expected result of simple random sampling with replacement is a uniform distribution, however, the reverse is not necessarily true. Just because you have a uniform distribution does not mean that it was generated using the aformentioned sampling scheme. Put another way, rejection of the null hypothesis is a good indicator of non-randomness, but failure to reject by no means makes you certain of randomness. You could just as well have generated 179 1’s in a row, 210 2’s in a row, etc, and come up with exactly the same distribution that passes the chi-square test.
Depending on how much effort you want to put into this, can Google for “randomness test”. The NIST also has something to say about Random Number Generation and Testing.
BTW, the easiest sampling scheme for the kind of survey you are proposing is “simple random sampling without replacement”.
Don’t even think of doing this. If you, as a human, choose “representative” records, then you’re going to find exactly the results you expect to get. You don’t actually know which ones are representative until after you’ve done your random sampling, by which time you don’t need them any more.
OK, I’m sorry. It’s probably completely innaplicable here (c’mon grimpixie, tell us what you ARE trying to do ) and Chronos is right it’s a bad idea to try… But in more complicated situations it’s perfectly normal, such as market research when you sort people by age, politics, etc.
Normal doesn’t mean it’s right. You can cluster and stratify and factor – plenty of well-worked out statistical techniques to do those marketing studies. Choosing representative instances is just plain wrong.
Possibly somebody that knows more can explain it. I don’t know what the statistical techniques are, but I suspect what you mention is what I mean by ‘representative’.
Eg. You want to find the average something. It is already known that men’s and women’s something ~ N(u,ss) and N(v,ss) respectively. If you choose two people at random and average their height, there’s a 50/50 chance your result will be way out. If you choose one of each, it’s relatively tightly distributed about u+v/2. Is my analysis wrong? Or did I use the wrong name for it? Or is it just unlikely to be useful in this case?
If you want real primo entropy random.org can hook you up. The main things to be worried about are that the period of your built in computer based random numbers (PRNG) might be too short (sometimes as short as 2^16). That is, after 65536 samples the numbers repeat. The other big thing to keep in mind is that the way you seed your PRNG may also be significant, regardless of how cool the PRNG. Normally the seed is calculated from the system clock, this has at most 32 bits to work with (generally) so while you might think you are generating 150 intependant coin filps, what you are actually doing is looking up one of 2^32 sequences of 150 coin flips that have been certified to look really random. All that being said, people managed to do quite a bit of science while using random number tables published in books, and those tables didn’t have 2^32 numbers to choose from. The only ways where the PRNG’s are inferior to an actual table is that deep down there is probably a simple relationship between consecutive numbers(as each number is calculated based on the previous number/state). It’s generally unlikely that this relationship will cause problems, but if you are really parainoid you can go to the above site.