No Statistical Significance

Cecil, in he latest column, states,"…[T]he conts differ by less than 0.2 percent, a statistically insignificant difference…"

I often see in the results of studies, etc., that the difference is of no statistical significance. WTF? When does a difference become significant? Is there set and precise rules when it doesn’t, or is it something the analyst concludes? Perhaps some statistics major knows the answer. Perhaps some one else does. Perhaps not.

A lot depends upon the sample size and stuff. When I took my statistics class in college, I could have nailed this answer. However, others can explain standard deviations better than I.

Statistical significance tests are a way of trying to sort out normal variation from true findings. Basically, these tests tell you what the probability is that a finding is by chance.

There’s various ways of testing significance, but they generally have to do with:

  1. the size of the sample (the bigger the sample, the smaller the difference needs to be to be significant)
  2. the variability of the data (the more variable the data, the bigger the difference needs to be to be significant)
  3. the size of the sample in proportion to the whole population

There’s probably more, but that’s all that’s coming to mind.

(Note the correct omission of “hi Opal” which is to be used only when one’s list otherwise would have 2 entries. Who says I haven’t been paying attention?)

These tests are indeed based on mathematical equations - and something can quite objectively be said to be statistically signficant or not. In general, if the test suggests that there’s a 95% chance that the findings are meaningful, it is accepted as statistically significant.

That’s why when (reputable) polls are published in the paper, you get a statement along the lines of “these results are considered acurated +/- 5 percentage points, 19 times out of 20”. What that’s saying is that according the the statistics test, if you conducted that poll 20 times, 19 of those times the results would fall +/- 5 percentage points of the results reported.

However, just because something is “statistically significant” doesn’t mean it truly is meaningful. I do a lot of work in marketing to physicians, and they are regularly dismissing “statistically significant” results simply because from a practical point of view, their not meaningful to everyday life.

I know there’s more to it. I know someone else will be happy to fill in the gaps.

D18

Most statistics is based on a calculated figure known as the p-value. The p-value has a technical defintion that is a little hard to wrap your head around, but is worth it.

There are two hypotheses. One, known as the null hypothesis (H-sub-0), is the one that says nothing special is going on. The other, known as the alternative hypothesis (H-sub-1), is that there is something significant happening.

**The p-value is the chance that the observed result could have happened by chance alone under the null hypothesis. **

Say it was .0001. That means, if the null hypotheses was correct, we would see somehting like this 1-in-10,000 times if chance alone was responsbile. One in 10,1000… that just seems a bit much to believe that this particular sample was the one-in-10,000, so we conclude it wasn’t just chance, and decide that the alternative hypothesis was correct.

Example: You flip a coin 200 times. You get heads every time.
H-sub-0: Coin is unbiased
H-sub-1: Coin is biased towards heads.
Observed sample is 200 heads. You do some math and you get a p-value of .00000000000000001 or so. Telling you that if the coin was unbiased, you would virtually never ever see results like this. Therefore, we conclude that the coin is biased.

If you’re still with me, you may be wondering, at what point (p-value) do we switch from H-sub-0 to H-sub-1? 1-in-2 chance? 1-in-5? 1-in-100? There is no technical reason to prefer one over another. However, generally, scientists use the .05, or 1-in-20 barrier. If the p-value ends up less than 0.05, they will decide that the experiment showed something of significance. (Note this means that of 20 published papers at the 0.05 level, on average one will be wrong, a “false positive”.)

Still with me? Probably not, but what the heck. The next thing to know is that this p-value is roughly derived from a formula that is something like the difference between (a)what you saw and (b) what you expected to see if the null hypotheses was true, divided some measure of the variance of the sample. (Variance means how spread out the values are. 1,2,1,2,1 has a lower variance than 10,300000,10,-50000,100000000.) So if the difference between what you saw and expected to see is very low, and the variance is reasonably high, you can say the difference is statistically insignificant. This is another way of saying that the difference will lead to a p-value that is high, and therefore we keep the original null hypotheses that nothing special is going on.

Which brings us to Cecil’s statement finally. Since you didn’t link, I don’t know which column you mean, but we can figure out what he means. He knows what p-value is needed for us to accept the alternative hypothesis, that is say this phenomona is statistically significant. He knows the formula. He knows that plugging in 0.2 as the difference, divided by the sample variance, is going to result in a p-value much larger than the standard threshold value of 0.05. Therefore, he says the difference of 0.2 is statistically insignificant.

Note: There are many kinds of statistical tests. The math differs on all, but conceptually they’re all more or less like this.

A bit more about variance…

The reason we look at it is that if there is a lot of it, then the difference would have to be pretty big for us to accept H-sub-1.

For example, you have a machine that spits out rods that are 6’ long. It’s variance is normally extremely low, in a million rods only one will be more than .001" off. So if you take a sample of 10 rods, and they are all more than 6.1 feet, you would conclude something is going on.

On the other hand if you took a sample of 10 humans and measured their height and got 6.1 feet, and you knew the average was 6 feet, you wouldn’t be so quick to conclude something was going on. Human size varies a lot, so you’d have to see a more dramatic difference to feel confident. (Warning: I have not done the math on this at all.)

Variance does take sample size into account.

So to answer D18’s post…

  1. Sample size matters. It is part of variance, and so it helps determine “how far away” your answer needs to be from the one if H-sub-0 is true in order for you to reject H-sub-).

  2. Variability matters. As I’ve been ranting about.

  3. Sample size compared to population does not matter. This is why CBS can do a poll of 500 people and announce it on the evening news with some measure of statistical validity. 500 is enough for a good statistical measure, even though the national population is 600,000 times bigger than that.

Well, you’re right for large populations, such as the population of the US, but not for small populations. According to:

http://www.surveysystem.com/sscalc.htm

You’ll find the following:

The mathematics of probability proves the size of the population is irrelevant, unless the size of the sample exceeds a few percent of the total population you are examining. This means that a sample of 500 people is equally useful in examining the opinions of a state of 15,000,000 as it would a city of 100,000. For this reason, The Survey System ignores the population size when it is “large” or unknown. Population size is only likely to be a factor when you work with a relatively small and known group of people (e.g., the members of an association).

D18

PS: No endorsement of surveysystem is implied, it’s just the first thing that popped up when I set my sights on a site to cite! (Wish I knew how to make one of those yellow smilie thingys.)

I’d just add the the sampling of a population should also be random and independent. The above posts do a good job of explaining significance.

I’m always a little skeptical when I see statistics used in the general media, you really need to know how they were calculated in order to decide for yourself whether the stats are valid. Statistics are a tool that can be used inappropriately just like anything else.

I remember somebody telling me that using product A could make me be 3 times more likely to get cancer. My reply was that if I buy 5 lottery tickets instead of 1, I’m 5 times more likely to win the jackpot. There’re times when you really need your BS detector.

Thanks to every one. I really need to ponder Muttrox’s posts more thoroughly, but I got the general picture. D18, go to the bottom of any thread page. You will see smilies underlined. Click on it for all the smilies. :smiley:

Correction noted D18. I knew as I typed it that there were exceptions, but I couldn’t remember what they were. Oh well!
:slight_smile:

Finally found the article, now I can answer the OP!

http://www.straightdope.com/columns/010323.html

First thing to note is that it’s not Cecil making the claim, just Skip, so he could be wrong. My first impression is that Skip is just throwing the phrase around without having done any actual testing. Hard to tell.

But briefly, here is how it could be done.

H-sub-0 ETOIN SHRDLU is correct
H-sub-1 ETOIN SHRDLU is wrong

Observed results (sample) (see column):
Hm… more N’s than I’s, eh? Seems worrisome.

If H-sub-0 is correct, then what are the odds we’d see something like this? At this point I admit it’s been a decade since I got my degree and can’t think of the correct test to use here. Chi-squared? There’s probably more than one that could apply. But you would do is apply some stats to see what the variance is. Then you would work this back to see what sort of range you might expect to see if H-sub-O is actually true.

Once again, for those in back, the p-value is the probability of seeing the observed result by chance alone if H-sub-O is true.

For example, you might conclude that if there are 5.26 million I’s under H-sub-0, you would expect to see anywhere from 5.15 to 5.35 million N’s. To get this range, you need to plug in the famous p-value, and one would probably use the standard 0.05 number here. You would then look at the observed results and see that the number of N’s is indeed in that range, so we have no reason to reject H-sub-O, and we would agree with Skip that the difference is statistically insignificant.

Conversely, you might work through the stats and get a range of 5.2592 to 5.2602 million. (This range is also known as a 95% confidence interval.) If this was the range, then you would conclude (because the number of N’s is outside that range) that it was statistically significant, you would reject H-sub-0 and accept H-sub-1.

Clear as mud by now I should think.

muttrox’ explanation about p-values was valuable as was the illustration of coin flipping. I’ve taken five semesters of statistics classes (ugh!) and without exception all statisticians are particularly attached to coin flipping as a pure application of probability–card playing too–I think they’re all closet gamblers. I’ll simply add the following tidbits to clarify, expand, or out and out complicate things.

All the examples of statistics given in this thread are of the parametric variety, as opposed to nonparametric–which operates on a number of different assumptions than what we’re talking about. “The nominal significance level alpha is guaranteed only asymptotically, as the number of observations, n, becomes infinite” (Hollander and Wolfe, 1999). This passage was taken from a nonparametrics statistics book which recognizes that the traditional treatment of p-values is misapplied in some discussions where the sample size is relatively small and/or the distribution is not normal (read: N(0,1)). Okay, maybe that was a bit much, I’ll try to get back to earth with this last note.

Any discussion of statistical significance presumes that there is some real distribution in the world. The null and alternative hypotheses referred to in other posts seems to refer to the true distribution of some population (as opposed to a sample population). When something becomes statistically significant the sample distribution differs (either in terms of central tendency: mean/median or spread, variance, standard deviation). This just means that a sample can be different from a population or another sample if the two means or medians are statistically significant or if the two variances are different.

I take this quote from a book titled “Statistics for the Terrified” by Kranzler and Moursund (1995): “The primary purpose of inferential statistics is to help you to decide whether or not to reject the null hypothesis and to estimate the probability of a Type I or Type II error when making your decision. Inferential statistics can’t tell you for sure whether or not you’ve made either a Type I or a Type II error, but they can tell you how likely it is that you have made either type” (p. 71). This phrase “how likely” refers to that all-important p-value. Briefly a Type I error is when you (the researcher) say “Aha! Look the sample I took is indeed different than the population.” When in reality it is not different. A Type II error is when you the researcher say “Nope, there is no difference between this sample and the underlying population.” When in reality there really is a difference. There’s a lot more to it than that, but I doubt that anyone has read this far down so I’ll call it good here.

I take great exception to your gross stereotype of statisticians as closet gamblers. I am not, nor have I ever been, in the closet about it!

JHARDING I read your post and now I need a parametric. I got a splitting headache.

**MUTTROX ** I gave no link because my question was a general question, not specifically to Cecil.

Thanks every one.

But barbitu8, you saw that that was jharding’s first post, and you meant to say “welcome to the SDMB, jharding!”, right? :wink:

(See I can’t do that yet, because I’m fairly new here myself!)

Just to help with your headache, all you really need to know for the purposes of basic media literacy is this:

When you look at a poll in the paper, always look at the error margin and the sample size. If the sample size seems like a fairly reasonable number, you can be more confident that it’s representative of the population at large. Next look at the error margin - if you see that Politician A is ahead of Politician B by MORE than the error margin, then there’s a good chance that there really is a difference in support.

Yes, results can be skewed by leading questions, unrandomized sampling, etc., but generally most reputable polling firms are above that kind of crap in order to maintain their credibility.

D18

There’s a lot of confusion about this, actually. The correct way to determine significance is to do so before gathering data. Any decision about what is significant and what is not that is made after the data has been gathered is, strictly speaking, invalid. Here’s how it works: the statistician chooses some hypothesis which he believes would be useful to disprove (e.g. “the letter e appears with a frequency no greater than any other letter”). He then decides what test he’s going to perform to disprove it. Next, he calculates what the probability is, given that the hypotheis is true, that the test will falsely show the hypothesis to be false (that is, “disprove” the hypothesis when the hypothesis is in fact true). This is the significance level. Other posters have mentioned the “p level”. This is calculated after the test is performed, and refers to the probability of getting the results that you got purely from chance, and not from the effect that you thought existed. Note that the p value is not the same as the significance level; the p value is calculated from the data, but the significance of the test is completely independent of the results. It is a property of the test, not the data. This is a very important point, and forgetting can lead to severe fallacies.
The standard significance level is 5%, and it has become so standard that it is basically the default significance, so if no signifance is decided on before the test, it’s usually assumed to be 5%. However, this is, strictly speaking, not a very rigorous way of determining significance, especially if there are more than one parameter.