I can see from searching that this topic has been covered several times in the past, but stats was never my strong point so I’d appreciate a little guidance with a specific case I have. This is not homework, just a quick experiment I did while bored at work.
So I made a standard six-sided die from a blob of Blu-tack, and rolled it 120 times, recording the result each time. Clearly, Blu-tack is not a good material for manufacturing a fair die, but I thought it would be interesting to see how fair it was. Here are the raw results:
1 - 13 times
2 - 17 times
3 - 31 times
4 - 21 times
5 - 20 times
6 - 18 times
Intuitively, it looks biased, but can I prove it with stats? As far as I understand what I have read, the chi-squared value of those results is 9.2 (2.45 + 0.45 + 6.05 + 0.05 + 0 + 0.2). Comparing this with a chi-squared distribution table with six degrees of freedom suggests a p-value of about 0.15. So I conclude this die is unfair with 85% confidence. I understand this loosely means that there is a 15% chance my die is in fact fair.
First question is whether my conclusion and working is correct? Second one is, how can I approximate how many trials might be needed to increase my confidence in the conclusion (assuming I continue to get a similar distribution of results, of course)? Thirdly, are there other tests that are more appropriate in this situation? Thanks in advance for any insight.
Just missed edit window - now I wonder if I should only be looking at 1 degree of freedom, because once the value of each roll is determined, the other 5 values become impossible for that roll? If so, that would increase my confidence that this is not a fair die to nearly 99.9%, so quite a key difference! I was unable to work this out from looking at the Wiki on degrees of freedom.
Without any maths, you know that all six numbers should be equally represented, so it does look biased. However, 120 is far too few rolls to get a fair sample. I would think that any fewer than 1000 would be a too-small sample.
Maybe, but apparently chi squared can be used for sample sizes of about 30 upwards, so I think it is still meaningful - a bigger sample should generate greater confidence in the conclusion though.
I learned this as “chi square” in my stat class, not “chi squared”. Wiki insists on the latter, but every other first-page reference on Google prefers the former. Picky, picky, picky.
It’s been a long time since my stats days, but I believe you should be looking at five degrees of freedom for this. Anyhow, your results give you a p value of about .10, which is usually still considered not statistically significant (a p of 0.05 is usually used as the cutoff for statistical significance. At least that’s how I remember it from back in stats class in university.)
Chi-squared = 9.2; dF = 5 which is just below the 90% probability that the die is unfair.
Degrees of freedom is always the number of categories - 1.
So 10% probability it is a fair die.
Not always. It’s actually the number of “free” things there are. In this case you know that the 6 percentages must add up to 100%, so you can only specify 5 freely. That is a common case, but in a two-by two table contingency table, there are four data points but only two degrees of freedom. A two-by-two contingency table would be used, for example, to determine if more of the test group recovered than did the placebo group. There are four types, test-remedy recovered, test-remedy didn’t, placebo recovered, placebo didn’t. But you already know the number with the test remedy and the number with placebo (or at least the statistician does in a double blind test), so there are only two “free” outcomes.
I second the importance of the distinction septimus notes. Were your p-value calculation accurate (you actually need 5 degrees of freedom, giving a p-value closer to 10%, as noted, but nevermind that), all you could say would be “A fair die would yield a chi-squared statistic at least this large about 15% of the time”. This is different (very different, extremely different, completely different) from saying “There is a 15% chance that this is a fair die”. The probability of A given B and the probability of B given A are not the same thing.
Consider, analogously, if we were to calculate not simply the probability of getting a chi-squared statistic as extreme as observed, but, in fact, chose to use all the evidence at our disposal and ignore none of it: that is, if we answered instead the question “What is the probability, with a fair die, of getting the sequence of outcomes observed?”.
Of course, the resulting p-value would be simply 6[sup]-120[/sup]. Which seems quite low indeed. And can be made arbitrarily lower by simply rolling the die more times in a fashion which depends not at all on the nature of the die or its outcomes. A perfectly fair die would be guaranteed to “fail” this test with enough rolls, just as well as any other die.
So there is something quite glib indeed in identifying the answer to the question “What is the probability a fair die would produce such things as I’ve seen?” with that to “Given what I’ve seen, what is the probability this die is fair?”.
Your case will follow to good approximation a 5-degree-of-freedom chi-squared. The distribution that your chi-squared test statistic actually follows will be slightly narrower than a true chi-squared distribution, owing to the low-ish number of counts, but only by a small amount. (The true chi-squared RMS would be 3.162; your system’s RMS is 3.149 [via simulation]). The distribution is also somewhat discretized as there are only so many possible combinations of numbers you can get.
The chi-square value of 9.2 that you’ve obtained experimentally is worse than 89.780(3)% of unbiased cases, is exactly the same as 0.4559(7)% of unbiased cases, and is better than 9.764(3)% of unbiased cases. The numbers in parentheses are the approximate uncertainties in the last digits owing to the size of my simulated ensemble.
To address the hypothesis in question, you can say that data as “bad” as yours from an unbiased die would happen 10.219(3)% of the time. A chi-squared table would yield 10.135% (so, a pretty darn good approximation). And, as others have said, this is not the same thing as the probability that your die is biased.
[What follows is just some elaboration, hopefully interesting…]
Of note, a chi-squared test statistic is only so sensitive. Depending on the type of bias you want to test for, you could form other test-statistics that could do better. Certainly a die that produced numbers in sequence (1, 2, 3, 4, 5, 6, 1, 2, …) would be a poor die, but the chi-squared test wouldn’t see a problem with that. Or, if you suspect that all numbers are weighted fairly except for one that comes up too often and another that comes up too seldom, you could use the sum of the squared deviations of the largest and smallest counts. A simulation of your scenario under that test-statistic shows that an unbiased die would produced data as bad as yours only 4.862(7)% of the time. Of course, you can’t shop around for a test-statistic post hoc…
To go yet further on the topic of interpretation of statistics, this die, like any other physical instantiation of the concept of a die, is guaranteed 100% to not be fair. It’s sure to have some imperfections, however small, and for any given die, there’s some number of rolls which would be adequate to make that imperfection obvious.
You could, if you like, come up with some sort of numerical measure of unfairness (say, the maximum deviation of any of the probabilities from 1/6), and then define a die to be “fair enough” if its unfairness is less than some set threshold. In this case, you could find the probability that a die is “fair enough”.
df = 5
p = 0.101348 if I plugged them into Excel right.
Conclusion is incorrect. You cannot conclude that the die is unfair. Please note that a IF significant result says it is unfair, but you cannot conclude *which *face is coming up more/less often without further tests.
A priori via power analysis. Using GPower (free), I see that you can expect that the sample size (nRolls), given the following parameters, would be:
Beta = 0.2, effect size w = small (0.1) = 1283
Beta = 0.2, effect size w = medium (0.3) = 143
Beta = 0.2, effect size w = large (0.5) = 52
Beta = 0.05, effect size w = small (0.1) = 1979
Beta = 0.05, effect size w = medium (0.3) = 220
Beta = 0.05, effect size w = large (0.5) = 80
Power = 1-beta; 0.2 beta or 0.8 power is generally considered “good enough”
Effect size is what you expect it be, e.g. based on previous research. If you overestimate, you may not find the results you should. If you underestimate, you may do more rolls than you need
Chi squared is at least an appropriate test. I say the “d” should be on there, like how the 1980s standard “that’s so cliché” should be “clichéd.”
This is a (initially) hard to understand but very important distinction.
Let’s use a simpler example. You have a coin, and you want to know if it’s fair or not. You flip it several times, and it comes up heads every time. After how many heads do you conclude that it’s unfair?
The answer depends on the circumstances. If it’s a standard coin you got from ordinary circulation, then the odds are very good, from the outset, that it’s very close to fair. Let’s say, for instance, that only one out of 1000 coins in general circulation is rigged. In that case, you would need about 10 heads in a row before you should believe it to be a rigged coin: Before that, you know that you’re witnessing an unlikely event, but you don’t know which unlikely event you’re witnessing. And even after 10 heads in a row, the hypothesis that it’s rigged is only slightly favored over the hypothesis that it’s fair and just having a freak streak: On the one hand, you can say “It’s really unlikely that a fair coin would give a streak that long”, but on the other hand, you can say “It’s really unlikely that I would have gotten a rigged coin in my change from McDonalds”.
On the other hand, suppose that the context is that you’re at a convention for stage magicians. Rigged coins are probably pretty common in that environment. In that case, you might be justified in concluding that a coin is rigged after only five heads.