Several probabilities questions

I’m playing around with Random.Org and a few questions popped into my head.
I created a random list of 10,000 numbers from a list from 1-10,000. Since there were repeats, it was obvious that the random number generator didn’t eliminate numbers once they were used.

I plug the list of numbers into an Excel spreadsheet column and put them in numerical order. Then I compare that list of numbers to their row numbers and note the difference between them.

For instance, if the numbers 1-8 were chosen at random for 1,5,8,2,6,4,5,4.
In numerical order they’d be
1
2
4
4
5
5
6
8

And you’d note that #1, 2, 4, 5, and 8 were the 1st, 2nd, 4th, 5th and 8th numbers.

I did this twice. Once 202/10,000 matched. The second time 141/10,000 numbers matched.

First question is what’s the expected number of matches in a list of 10,000? How would you even go about figuring that out?

Second, I also looked at how far off the number was from the row number. I subtracted the row# from the actual# and came up with one range of -99 through 34 and another with a range of -91 through 27.
What would the expected range be over the long term of doing this experiment?

I can’t answer your question off the top of my head, but what you’re looking for for the first question is “order statistics.”

I don’t have the math handy to back me up, but I would expect both to scale as the square root of the sample size. Thus, for instance, if you picked a million numbers up to a million, you’d have a thousand-ish of them in the right place, and the ones that were out of place could be out by as much as a thousand-ish places.

I don’t know the answer to the questions but

it wouldn’t be a random number generator if it did.

The general solution “expected number of matches” when drawing n number from 1…n with replacement is:

Sum for 0 <= k <= n-1 (n!/(k! n^(n-k)))

When n = 10,000 Python tells me the expectation is 124.99912186808118.

Interesting. So you’d believe that the number of matches should roughly equal the total range away from 0? My ranges were around 120-130 for both. Both heavily skewed in one direction, though I imagine that was just a quirk of two samples and it would even itself out over time.

Sorry, what’s k?

k is the index of summation.

To Answer the second question, consider dividing all the number on your list by N so that all of your data lies between 0 and 1. Then looking at the largest deviation is exactly the same a performing aKolmogorov–Smirnov test. for repeated runs the number of negative or positive ranges should be symmetric, but within a run of 10,000 there may by chance be a tendency one way or another.

Overall you expect a maximal absolute difference of about 0.83Sqrt(N) with 95% or repeated runs falling between 0.48Sqrt(N) and 1.48*Sqrt(N)