How do I know if a statistical effect is significant?

I don’t know any statistics, but I’m thinking about trying something which probably requires that I ought to know something about statistics.

Suppose I have a chart with the following columns:

Classroom; Semester; Grade; Course; Instructor
The content of the first and second columns are probably obvious. The third column, in case it’s not obvious, would contain the average grade awarded for whatever course met in that classroom in that semester. Columns four and five I take it are also obvious.

And say I find that the grades in classroom A average out to 3.2 (on the four point scale) while the average of all classrooms is 3.3.

Some questions, and if these are completely clueless questions please feel free to berate me, but I’d like to know if this is something I can easily just pick up on how to do. If you think anyone with half a brain should be able to figure out how to do this him or herself, feel free to berate me about that as well. I plead twins, workload, and a history that includes no statistics education whatsoever. Anyway, the questions are:

  1. How do I know whether the size of the effect is large enough to be significant?
  2. How do I do the whole “correcting for” thing I always read about in reports on statistical measurement? For example, correcting for the course in the sense of trying to see if the difference in scores is due to the course or instructor rather than the classroom?
  3. Aside from the other two questions, are there any really snazzy things I should be able to do with this data?

You’re going to need a formal statistics course to really do anything useful with it. I’ll just give this much of an overview:
What you want to know is if the difference you see is due to random chance or due to some effect that you’re trying to measure. The general idea behind statistics is that you can calculate exactly how likely it is that random chance could give you the difference you observe. You then set a cutoff. In science, it’s often 5%. If there’s a 5% or less chance that what we observe could be caused by random chance, we consider the difference to be significant. But that number is arbitrary.

  1. There are formulas for that, depending on which values (sample size, population size, etc.) are known and which are variable.
  2. “correcting for” generally means you have a known skew factor (e.g. “value x increases by y% in circumstance z”), otherwise, you just chop off the outlying values (there’s a formula for determining how “weird” a value is, commonly measured in ‘standard deviations’)
  3. The only interesting things you can do is maybe discover some interesting correlations.

First, you’re going to need not just the mean for every class, but also the standard deviation, which is the absolute value of the average amount by which each individual score in that class differs from the class mean. (That is, for each individual score, subtract it from the class mean and take the absolute value of the result; then take the mean of all those absolute values.)

Generally, you would then do a one-way analysis of variance to see if there is a significant difference in mean scores across classes. If there is, that just tells you, basically, that some class mean differs significantly from some other mean; it doesn’t tell you which. If you want to know which ones differ significantly from which other ones, there are various types of “post-hoc” tests you can do. Essentially, these are like a bunch of t-tests, comparing each class to every other class, with the significance level of the results adjusted to take account of the fact that you’re doing multiple tests on the same data (and thereby increasing the likelihood that you’ll get a significant result on one of those tests purely by chance).

With respect to the question of what is causing the difference, that’s another story altogether. Do you have just one mean per classroom/instructor/etc, or do you have (for example) means from five different courses taught in the same classroom?

Well, you need to acquaint yourself with the terms “variance” and “standard deviation”, with the latter being somewhat more important.

A data set’s variance is the average of each data point’s difference (actually, the square of each data point’s distance) from the mean average. Example (drawn from wiki):

Observations: 2, 4, 4, 4, 5, 5, 7, 9
Mean (average): 5

To find the Variance, add up the squares of the differences of all the individual observations from the mean:
(2-5)[sup]2[/sup] + (4-5)[sup]2[/sup] + (4-5)[sup]2[/sup] + (4-5)[sup]2[/sup] +
(5-5)[sup]2[/sup] + (5-5)[sup]2[/sup] + (7-5)[sup]2[/sup] + (9-5)[sup]2[/sup]
= 9 + 1 + 1 + 1 + 0 + 0 + 4 + 16

…and take the mean of this: 32 / 8 = 4

If all the observations were the same (i.e. they were all equal to the mean), the variance would be zero. If they’re all nearly the same, the variance will be quite low. If there are a few outliers, they increase the variance, and if the data is all over the place, the variance will be quite high.

The standard deviation is the square root of the variance. in this case… 2. The standard deviation is a useful guide for judging how much of an outlier an individual observation is. Through math that I won’t get into here, it’s is generally acceptd that if a population is “normal”, i.e. not biased in some particular way, 68% of it will be withing 1 standard deviation of the mean, 95% of it will be within 1.96 standard deviations of the mean (a common value used in analysis) and 99% will be within 2.58 standard deviations of the mean.

What you’d have to do is pick a level of “significance” (95% is a useful and common standard, 99% also) and see if there are observations that fall outside this range. In the above example, we can say the 9 score (which is two standard deviations above the mean, i.e. 5 + 2 + 2) shows significance at the 95% level. If we applied the 99% level, a score of (5 + 2 x 2.58) 10.16 would be required.

The rest of the scores are sufficiently close to the mean that we can’t say there is a significant difference for any of them, unless we choose to define significance at a very low level, like 50%, though this is useless for serious analysis. The thing about a 95% level of significance is you’re looking for the unusual 5% that falls outside that range (or the unusual 1% for a 99% significance). If you lower the standards so that more people can be significant, it makes the concept of significance useless.

Related, the standard deviation is symbolized and referenced by the Greek letter sigma. The “Six Sigma” strategies are meant to keep the process within a very tight specification, i.e. the acceptable population will be within three standard deviations above or below (making six in total) the mean, covering 99.99966% of the total population. For industrial production, this is a pretty high standard.

It would be almost impossible to get meaningful data from this without doing some rotation, moving instructors and courses from room to room. Otherwise you don’t know whether Room 104 gets lower scores because the room has bad acoustics or because it’s mostly used by a particularly lousy teacher or because the classes taught in it tend to be harder.

I’m thinking about sitting in on the stats class here at my university next Fall, but is the kind of thing I’m talking about what you’d typically get in a first stats class? (I can’t ask the instructor because he’s gone this semester).

Another prof here, when I described what I’m tossing around here, said “Just teach yourself R.” Sounds realistic or not?

And here’s another (possibly subtle) point: Not only do you need to decide how significant is “significant enough” for you to decide that your results are “significant”, but… Wait for it…

You need to decide on an appropriate significance level before you analyze your data (and even better still, before even collect the data).

Otherwise, it’s too easy to get data that looks very suggestive, and for the researcher to decide he’s got significant results just because they sort of confirm what he’s looking for. There’s a real possibility of bias there.

Suppose you decide that you want your data at the 95% significance level in order to be convinced that you’re really got a significant effect. And then your data weighs in at 94.8% – Do you fudge a little and decide that this is really good enough after all?

To prevent this, there is the general rule that you have to decide your “significant enough” cut-off level before you actually look at the data you got, and then be hard-nosed and hold firm to that!!!

As I put it in an introductory stat class once (I was a student, not the teacher): “You make the rules before you play the game!”

Anyone know anything about this book? It purports to help a person learn both introductory statistics and basic statistical programming using R.

Is there no way to avoid a necessity for judging something “significant” simpliciter, instead simply dealing somehow with the fact that it has a significance of x%?

If you’re strong enough on math and motivated enough, you might be able to teach yourself enough of what you need. Maybe. A lot of statistics can be boiled down to formulaic boilerplate rules, procedures, and formulas. Lot of researchers doing statistics out there without too much knowledge. Just collect the data and plug it into your calculator, and Presto! Out comes your analysis of variance. With a few big (and often ignored) caveats! (See below.)

How much math do you have? I’m thinking, from your previous posts on math and scientific subjects, that you’re no math/science dummy. You definitely need a solid fluency in basic algebra – at least a full year of college level algebra, I think. You don’t need any calculus (but I think it helps just to be calculus-level fluent in math). You definitely don’t need trigonometry.

When I took intro stat, I already had math through Differential Equations. The prerequisite, however, was just one semester of algebra. Most of the class had just that, and they weren’t all highly algegra-fluent even at the first semester algebra level. For them, it was a seriously difficult challenge! The class started with standing-room only. The instructor let everyone into the class (they were packed in the aisles and flooding the walkway outside the door), because he knew full well that most of them wouldn’t last through the first week.

By the end of the semester, there were only 11 (eleven) left in the class.

As for that caveat about plugging data into a formula: You need to have a good experimental design to collect good data that actually “contains” the knowledge you are trying to extract. This is exactly what Gary “Wombat” Robson is referring to, above. Then, you need a good non-biased way to collect that sample of the entire “population” of data in the world, and an appropriate choice of which statistical tests to use (of which there are many). Botch any of that, and your formulaic calculations will give you excellent GIGO results.

One approach we briefly discussed was this: Instead of choosing a “cut-off” significance level and using it to decide if your data is significant, you could simply do your analysis, and publish that you got data at “x% level of significance” – and then leave it to your readers to decide if they think it’s “significant enough”.

This means, basically, that you are refraining from drawing your own conclusion about whether you’ve got a “real effect” working there, and leaving everyone else to decide for themselves.

A further comment on the high significance levels (commonly, 95%) commonly chosen as the “cut-off level”:

Unless your “sample” includes the entire population in question (in which case, it is called a census), you can never be absolutely perfectly totally certain that your sample is truly representative of the total population. There is always the possibility, theoretically, however improbably, that you might just draw a “bum sample” just by the luck of the draw. So it’s always possible that you could get a very significant-looking result just by random chance and bad luck.

You could decide that you need a 75% significance level. (And you still need to decide what test you will use, and have some a priori estimate of that “standard deviation” mentioned above, to decide what actual data scores would constitute a 75% level.) And suppose your data actually does meet your 75% level criterion! This still leaves a 25% possibility that you could have gotten that result just by the bad luck of the draw. So you might consider the data “suggestive” that you have a real effect there. But a 25% chance that you could get the data wrongly is still considered way too ambiguous.

If you have a seriously “real effect” working, you would hope that it has some seriously real effect on the data – like, it should be strong enough to support a 95% significance level. The choice of 95% is designed to leave you with only a 5% chance that you could get a certain level of data “by chance”. By intention, that level is chosen so that nobody (climate-change deniers excepted) could call it “ambiguous”. If you get that 95% level and somebody still insists that you could have gotten that just by chance (which is theoretically true, of course), that’s your climate-change denier or creationist speaking.

Stay tuned for next lesson: How to decide if a “weak” (as opposed to “strong”) effect is, in fact, a bona fide effect . . .

R is an extremely powerful piece of software. If you’re going to be doing statistical analyses regularly, then you should learn R.

If you want to do a problem like this, once or a few times, then learning R is like shooting a mosquito with a bazooka.

A bit of clarification, just for added emphasis: In talking about getting some result “by chance”, this specifically means drawing a sample (a subset) from the entire population that, purely by the random bad luck of the draw, does not accurately represent the entire population.

Now, what if you are looking for an effect (like effect of classroom on average class grade), and what if, in fact, there is an effect, but it’s a weak effect? That is, it has a small effect, but that might easily be lost amidst other effects (like acoustics, teacher skillz, etc, as suggested above). How could you ever detect this with a high degree of confidence?

The answer, in general, is that you need a larger sample. If the effect is strong, then a fairly small sample should be sufficient for you to detect it. If the classroom is really an important effect, then you should see a large variation in data dependent solely on the classroom. It might be strong enough to overpower a bunch of other effects, so the results will stand out.

But suppose the effect is real, but weak. Then it would “get lost in the noise” if you only had a small sample. But if you take an ever-larger sample of data, you could detect it.

An important thing to understand about the meaning of “significance”: (And I’m pretty sure, Frylock that you already know this): “Significant” does not necessarily mean “important”. It simply means “the data shows that the independent variable definitely has an effect on the (allegedly) dependent variable”, even if it’s only a weak effect.

Consider a simple experiment: Tossing a coin many times, to decide if the coin is fair or biased. If the coin in seriously biased, you might toss it 100 times and get 75 heads. You would conclude that, yes, it’s a bum coin.

But suppose your coin is only slightly biased. You could toss it 100 times and get 55 heads. Does that indicate a bad coin? Well? You could very well have gotten that result “by chance”. How would you know?

If you toss it 100 times again, a fair coin could just as well give you 45 heads and 55 tails. So, repeat this 100-toss experiment 100 times over (for a total of 10000 tosses). If it comes up with a few more heads just as often as it comes up with a few more tails, then it’s a fair coin. But if it comes up with a few more heads on every (or nearly every) 100-toss run, then you can conclude that the coin is slightly biased. Even though a single 100-toss run could never have detected that.

Thus the rule: The more your data is away from the “normal” (that is, high standard deviation), the less sample data you need to discover that fact. The more your data is nearly-“normal”, the more sample data you need to discover that it is, in fact, not quite “normal”.

Maybe. R is am amazingly useful thing, but ultimately it’s a programming language written by statisticians who definitely aren’t computer scientists. It’s very idiosyncratic, and it has a steep learning curve. There is tons of useful documentation, but most of it assumes the user knows a lot about stats and programming already.

I’d say that teaching yourself R is good idea if you already have some decent math chops and some basic programming skills.

I haven’t used that book in particular. I did use “Introductory Statistics with R” and “R in a Nutshell” when I was teaching myself. The former was pretty good and probably what you are looking for. It’s very light on the mathematical underpinnings, but it does introduce statistical concepts and basic sort of “cookbook stats”. (“R in a Nutshell”, fwiw, is a really good desk reference for when you have a task in mind but don’t know how to go about it in R. “The R Book” is a disorganized mess of a reference that doesn’t really have much that’s not in the freely accessible documentation.)

I do have idle notions of maybe doing a lot more of this kind of thing over time.

If his exact words were close to “Just teach yourself R.” then that’s not very good advice. You can learn to run R or SPSS and spit out some data. But are these data useful? Let’s say you know enough to be dangerous and hunt for the word “significance.” Aww, you got p = .432. But were you looking under Levene’s test, where > .05 is good because your variances are equal? With the knowledge of how to interpret tests, you can properly use these. I think sitting in on a class (professor agreement and time permitting) is a good idea.

As said, you’ll need either mean and SD/variance, or else all the scores. Mean doesn’t tell the shape of the data alone.

Important points raised re: statistical significance vs. importance or practical significance. First question is: is this significant. Second question that you probably should have addressed before running the experiment: will anyone care?
Another factor: effect size. Magnitude of the effect, which is not the same thing as significance. Like if the mean happiness is 60/100, and you significantly raise it to 60.6, is that difference actually worth anything?

Nitpick: …if the null is true, i.e. the chance of falsely concluding that something happened when in reality no effect exists is < alpha. It would be just dumb chance.

Replace with mathematicians and I’m looking at you, goddamn MATLAB.

Heck, Microsoft Excel has pretty much all the basic stat functions built right in.

Unless I’m missing something here, you’re not drawing a sample, your data reflect the entire population. Since statistical significance refers to the likelihood that differences found in a sample reflect those that exist in the population, it has no bearing on your situation. Presenting results based on an entire population and including significance information just brings home the point that (general) you don’t know too much about statistics.

You might want to say something about whether effects that you found are substantially significant - but statistical significance is not really a good way to help you make that assessment, I’m afraid.