Statistics question

My father in law asked me to look at some lottery numbers for him the other day because he is convinced that there is some cheating going on. I am just as sure that there is no cheating going on for the simple fact that it isn’t necessary for the operator to cheat to still make a killing. Curiosity got the better of me, and while I’m no math wizard, but I can do some pretty nifty things with data visualization that would make cheating stand out pretty sharply.
As expected, the numbers are being generated as randomly as one could hope for. Any amount of cheating that I couldn’t see wouldn’t be worth worrying about. I did, however, notice something unexpected (at least unexpected to me.)
There are 49 numbers (1 through 49 inclusive,) and six numbers are drawn at random. If you sort those six from lowest to highest (which is how the data I got were arranged,) then there is a clumping of values in certain positions. That is to say, there is a much higher chance of a 49 being the highest number than there is of a 20 being the highest number. There is a better chance of 25 being the third or fourth number than there is of a 10. There is a better chance of a 1 being the lowest number.
To some extent, I expected this. It is obvious that if you draw six numbers that you will never have a number lower than 6 as the highest, and likewise at the top end.
What really got me, though, was that there a range for each position outside of which (at least in the 2349 drawings for which I have data) there are no occurences.
As an example, there was not a single drawing in which any number over 35 is the lowest. Thinking this over, I see that the odds are against it happening, I just didn’t expect it to be that way.
What I am trying to ask (after all the background) is how I would go about calculating the chance that a certain number will be the (highest, lowest, Xth) in a certain drawing.
I’m not trying to beat the game, I am just curious about the mathematics behind the “islands” I see for each of the positions. I know that the islands are an artifact of statistics and a limited data set, but I would like to be able to calculate where they should be and how large (and how high the peaks are) for a given hypothetical number of data points.
Can anyone point me in the right direction?

I don’t know if this is the best way to do it, but here’s one way.

Okay, there are 49C6 = 13983816 possible ways to pick 6 numbers out of 49. There are 14C6 = 3003 possible ways to pick 6 numbers out of the first 14, and 15C6 = 5005 possible ways to pick 6 numbers out of the first 15. So, there will be 5005 - 3003 = 2002 combinations in which 15 will be the highest number. 2002 / 13983816 = 0.000143. So, for a sample of 2349, you’d expect 2349 × 0.000143 = about 1/3 to have 15 as the highest number. So getting 0 like that would be not at all unlikely. By symmetry, the chances of 35 being the lowest are the same as the chances of 15 being the highest.

The general formula is a little unweildy:

The probability that n is the highest number is given by: P = 6 × (n-1)! × (49-6)! / ((n-6)! × 49!)
The probability that m is the lowest number is the probability that (50-m) is the highest number.

For the Xth number, um, I think it’s a little trickier, but not terribly so.

Dumb ass question here:
What does 49C6 mean?
I follow you otherwise, and the math isn’t hard to understand. Thanks.

I don’t know the general formula, but here’s the basic idea.

Say you have 50 numbers. The chances of drawing a number 21-50 would be 30/50 on the first draw.

The second draw the chances would be 29/49 since one number would now be unavailable.

The chances of the first two numbers being greater than 20 would then be 30/50 times 29/49 and so forth. The chances of at least one number being less than 21 would be 1 less than that product.

Five consecutive numbers being greater than 20 would be:

30/50 x 29/49 x 28/48 x 27/47 x 26/46 = .067

By the way, one can never prove cheating by the results. There is always a chance the same set of numbers will appear day after day after day. But the chances become so extremely small, one would believe there was a fix, beyond all reasonable doubt.

nCk = n!/(k!*(n-k)!)

Ultrafilter:
Ahha. Got it.

Thanks ultrafilter. It’s called the Binomial Coefficient, and it’s often described exactly how I used it, that is:

nCk = the number of ways of picking k elements out of a set of n elements.

Incidentally, nCk is usually read as “n choose k”. If we were writing it down on paper instead of having to use a computer, we would put the n over the k and put a parathesis around both of them. (But that’s all covered in the link that Achernar gave, I just noticed.)

Actually, my father in law’s premise is that the lottery operator is analyzing the bets, and then influencing the drawings so as to minimize the payout. Since the millions of bets that are placed in the whole country should produce a pretty random set, you (or at least I) would expect to see a pattern in the numbers not chosen in the drawing. To minimize payouts, you would have to avoid the maximum probabilties dictated by the statistics.
In a rough overview, I see pretty much random noise. With another visualization that I use, I would expect a straight line across a graph of maximum probability for the six positions (which I see.) A deviation from the straight line would indicate cheating. There are fluctuations, but no severe deviations.
In this same display, cheating would also show as a widening and flattening of the ridge of maximum probability for all six positions. I’ll have to do some fiddling to see how wide it should be.

Mort Furd writes:

> . . . the millions of bets that are placed in the whole country
> should produce a pretty random set . . .

Well, no, actually. Do people get to choose their own numbers for their bet? Then I would expect that the bets are not randomly spread among all the numbers. People tend to choose numbers in highly non-random ways.

Wouldn’t it be easier to figure out roughly how often a payout should be made, and compare that to the rate at which they are made?

To determine the width of the ridge, I am trying to calculate the number at which there will be exactly one occurence as the highest in a set of six for 2349 drawings. I keep coming up with 15 as above 1 and 16 below one, which is okay because I don’t see how it could be an integer. It doesn’t jibe with Achernar’s calculation of 1/3 of a chance for 15 to be the lowest (35 to be the highest) in a drawing, but I’ve run through the general formula as he gave it and also solved for the whole mess another way and came out with the same answer, so phooey. I don’t know what is wrong.
Anyway, assuming I’ve done things right, there hasn’t been any twiddling with the probabilities, either. The answer of fifteen or sixteen fits pretty well with the graph.
Is someone still around who could tell me at what number I should expect one occurence as the highest (or lowest, doesn’t matter) in a set of six in 2349 drawings?

Nope. Unfortunately, they also payoff on partial matches and I have no data on them.
All told, to cheat as my father in law proposes, the operator would have to make a compromise between paying out one big sum or a whole bunch of little ones and see which way is cheaper. He would also have to weigh in the fact that frequent small payoffs (on partial matches, for example) would tend to keep people playing whereas less frequent big payoffs (and few little ones) might tend to drive off players - and also consider that a long period with no large payoff drives up the pot and entices more people to try for the big one. One great big hairy mess, and one the reasons that I don’t expect that they cheat - the others being that it would be too hard to cover up since 1955 and that it would be damned hard to do physically.

Well, I tried plugging in the numbers again, but I’m still getting the following expected values (for 2349 trials):

15 - 0.3363
16 - 0.5044
17 - 0.7737
18 - 1.0395

Now, it’s possible I did something very wrong, but you said that for 15 it’s above 1.000 and for 16 it’s below 1.000? This doesn’t sound right to me; the numbers should be increasing as you go up.

True, people choose their numbers in non-random ways, but there are millions using non-related systems to choose their numbers. If the systems are non-related, then the sum should appear random.
That is one way of generating random numbers, in fact. You take two systems that operate independently and asynchronously and combine their outputs as your random number. I considered how to do this once long ago in generating random numbers for a piece of portable equipment. The operator only activates it sporadically, so his actions are not related to anything that the equipment is doing. You just let a counter run during standby, and use the current value when the operator activates the system. If the counter runs fast (megahertz rates,) then you can get some pretty large random number rather easily.
Also, even if the systems are related, they will still quite likely produce random output. One of the favorite things to do is to use a birthday to make up the numbers. You would get a bias to numbers in the 1-31 range, but then lots of people will figure this and intentionally pick numbers outside of that range. On the whole, I do think that the bets are placed pretty randomly.

OK, got it. You are working on probabilities of a number being the highest in a set of six. I was working on the probability of it being the lowest in a set.

You can use my fractional method in my earlier post- determine the probability of exclusion of the first “x” numbers, then determine the probability of exclusion of the first “x-1” numbers and subtract.

To determine the expected a priori frequencies for x being the lowest number in 2349 trials, multiply the above result by 2349.

Example: x=5
44/49 x 43/48 x 42/47 x 41/46 x 40/45 x 39/44

x=5-1,then
45/49 x 44/48 x etc.

It’s hard to see where this gets you, relative to your original interest that the numbers are somehow not random. There will undoubtedly be a difference between the a priori probablity and the actual frequency for a given number but the total variance between these two for all possible numbers must equal zero.

Sorry, I got kind of side tracked. My father in law was interested in the probabilities (in reference to cheating.) I was interested in the statistics and math behind the distributions I saw in the graphs I made.
The graphs I’ve made are a special type of distribution chart I made up while looking for ways to eliminate noise in audio signals. Using the graphs, I can see non-random distribution quite easily even if I can’t mathematically describe it. The graphs are an X-Y plot with an intensity for each intersection point represented by color. For totally random data, without the restrictions implied in a lottery drawing, the graph shows a hill in the middle of my graph. The restrictions implied by a lottery drawing (non-repetition in particular) stretch the hill into a ridge and cause peaks at the ends of it - sort of a saddle shaped thing with the lowest point of the crest of the hill being at the center of the graph. Random data have a certain distribution across the ridge, and influence both the width and the height of the ridge in proportion to the total number of data points - with more data points, the ridge will be both higher and wider, but a non-random data set will distort the ridge shape. A systematic avoidance of the most likely numbers (as represented by the crest of the ridge) would result in the ridge being wider and flatter for a given number of data points than a truly random set with the same number of data points. The width of the ridge for a given number of occurences (which I picked as one for simplicity) of a certain number in a certain position can be calculated by the formulas above. Then I set the colors so as to highlight that particular level and see if the width matches - which it pretty well does.
If you break the ridge down into the six sorted groups of a lottery drawing, you form islands along what would normally be the ridge. The mathematics of these islands was what prompted me to start this thread, not the possibility of the lottery operator cheating - I know damned good and well that they aren’t.
Thanks for all the help, folks.

Mort Furd writes:

> True, people choose their numbers in non-random ways, but
> there are millions using non-related systems to choose their
> numbers. If the systems are non-related, then the sum should
> appear random.

That’s not necessarily true. In fact, I believe that I’ve read that bettors tend to use only a few systems, and these tend to result in highly non-random bets overall. The real way to use this is to put your bets on the numbers that are least likely to be chosen. That way if you do win you’ll be more likely to be the only winner. It’s probably still not a good enough system to make it worthwhile, but it does slightly improve the odds.

Which would make a truly random drawing the best strategy for the lottery operator.