I’m playing around with Random.Org and a few questions popped into my head.
I created a random list of 10,000 numbers from a list from 1-10,000. Since there were repeats, it was obvious that the random number generator didn’t eliminate numbers once they were used.
I plug the list of numbers into an Excel spreadsheet column and put them in numerical order. Then I compare that list of numbers to their row numbers and note the difference between them.
For instance, if the numbers 1-8 were chosen at random for 1,5,8,2,6,4,5,4.
In numerical order they’d be
1
2
4
4
5
5
6
8
And you’d note that #1, 2, 4, 5, and 8 were the 1st, 2nd, 4th, 5th and 8th numbers.
I did this twice. Once 202/10,000 matched. The second time 141/10,000 numbers matched.
First question is what’s the expected number of matches in a list of 10,000? How would you even go about figuring that out?
Second, I also looked at how far off the number was from the row number. I subtracted the row# from the actual# and came up with one range of -99 through 34 and another with a range of -91 through 27.
What would the expected range be over the long term of doing this experiment?
I don’t have the math handy to back me up, but I would expect both to scale as the square root of the sample size. Thus, for instance, if you picked a million numbers up to a million, you’d have a thousand-ish of them in the right place, and the ones that were out of place could be out by as much as a thousand-ish places.
Interesting. So you’d believe that the number of matches should roughly equal the total range away from 0? My ranges were around 120-130 for both. Both heavily skewed in one direction, though I imagine that was just a quirk of two samples and it would even itself out over time.
To Answer the second question, consider dividing all the number on your list by N so that all of your data lies between 0 and 1. Then looking at the largest deviation is exactly the same a performing aKolmogorov–Smirnov test. for repeated runs the number of negative or positive ranges should be symmetric, but within a run of 10,000 there may by chance be a tendency one way or another.
Overall you expect a maximal absolute difference of about 0.83Sqrt(N) with 95% or repeated runs falling between 0.48Sqrt(N) and 1.48*Sqrt(N)