Standard Deviation

After many years I’ve had to start using statistics
again in my job, and I’m once again reminded that I’ve
never understood why the computations for variance and
standard deviation use a divisor of N-1 rather than N.
After all, the variance is the average of the squared
differences between each measurement and the mean, and
averages are computed by dividing by N, not N-1. When
asked, all my instructors waved their hands, mumbled
something about “degrees of freedom”, and went on to
other topics.

As a result, I’ve never heard a fully comprehensible,
cogent explanation, so I turn to this board for possible illumination.

::waving hands::

[sub]degrees of freedom[/sub]

N-1 is only used in inferential statistics, where you are using a sample to estimate the variance of a population. It’s intended as a correction to ensure that you aren’t underestimating.

[Horshack]Oooooh! Ooooooooh! Ooooooh![/Horshack]

I FINALLY get to contribute something useful. I know this one. (Yes, the Masters in Statistics becomes useful.)

Now, to translate it out of Geek (er…Greek).

The divisor is N-1 because one divides by the degrees of freedom, not the number of observations. Why? Let’s look at the definition.

The number of degrees of freedom of a statistic is the number of observations minus the number of restrictions placed on them (usually equal to 1). For example. Suppose I have five numbers that sum to another number.

A+B+C+D+E=F

There are an infinite number of possibilities for all six numbers, with only specific combinations possible.

Suppose I add the restriction that F = 15.

A+B+C+D+E=15

There are still gobs of possibilities! However, if I know 4 of the 5 values, the fifth value is now set. For example:

1+3+C+6+2=15

It is “given” now that C = 3. If I know 4 of the 5, I actually “know” the fifth, because of the restriction that the sum = 15.

Only 4 of the 5 values can vary (are free to vary) because this “locks” the value of the fifth one.

because of the five variables and one restriction, 5-1 or 4 variables can vary i.e., 5-1=4 degrees of freedom.

With standard deviations, the sum of deviations from a mean must equal 0. This is the restriction. The degrees of freedom for SD calculations is the number of observations (deviations from mean) minus number of restrictions (sum equals 0) to give N observations - 1 or (N-1) degrees of freedom.

It’s that simple.

Be careful, some wacky statistics have more restrictions so the df for them is N-2 or even N-3!!

Let’s look at small cases to see if this makes sense. Suppose you have only one sample. If you calculate SD with n, you get zero. With n-1, you get 0/0, or indeterminate. Which is a better representation of what you know about the standard deviation from one sample:
a) It’s always zero
b) I have no idea what the SD is, because it is undefined for only one sample

Now suppose you have two samples, a and b. You their mean, m=(a+b)/2. Then you take (a-m)^2 and (b-m)^2=((a-b)/2)^2 and ((b-a)/2)^2
Hey look, you got the same number! Why divide by two, when you only have one number?

OK. The Ryan’s case of one observation is good. I really can’t make any estimates of dispersion from one observation. But then I didn’t really expect to.

With two observations, yeah, their squared distances from the mean are equal. But their average squared distance is still [(a-m)^2 + (b-m)^2]/2, not [(a-m)^2 + (b-m)^2]/1. Just as if I have two test scores of 88, the average is (88 + 88)/2 = 88, not (88 + 88)/1 = 176.

Suppose now I’ve got three numbers: a, b, c. Their mean is m = (a+b+c)/3. More times than not I’ll have three distinct squared distances from the mean. I really have an incredibly strong desire to divide by 3, not 2.

spritle gave a good explanation of what degrees of freedom are. But when the intuitive notion of variance is “average squared distance from the mean”, dividing by one less than the number of distances is still far from obvious.

I’m not going to do the math, but I’ll point you in a (hopefully) promising direction. You’re using the mean derived from the measured data, but for data with a variance, that calculated mean will tend to differ from the “true” mean by an amount dependent on the variance and on the number of data points. When using the wrong mean in your equation for the variance (with a divisor of N), the variance will be under- or over-estimated. I suspect, if you do the math, you’ll find an expected under-estimate factor of (N-1)/N.

For example, in The Ryan’s case with only two data points, there is a 50 percent chance the two measurements err in the same direction. It should be obvious how that could really skew the result.

If you somehow know the “true” mean, you do use N as the divosor, not N-1.

What I’ve heard, but never seen adequately explained, is this:

When you are doing your calculation based on the entire population of data points, and wish to measure (not estimate) the scatter in the form of a standard distribution, you normalize by N. Examples of this would be, I think, exam scores for a class; you have all of the data in your hand, and aren’t estimating anything.

OTOH, if the data you have to work with is a sample of the complete set of data, then what you’re calculating is an estimate of the amount by which the entire dataset varies about its mean. Like ZenBeam said, the way you apparently correct for this is to normalize by N-1.

Why? Honestly, I have no idea. All of the sources I’ve checked simply state that this is the way it is, and leave it at that. I’m frustrated.

Am I at least on the right track here? :slight_smile:

Distribution, deviation, whatever. Standard deviation is what I think I mean there.

jebert, if I understand your example correctly… It ends up being an extreme example, but that’s sort of useful

This all hinges on the fact that SD is a statistic, not a parameter. Sure, the average squared difference is 88 (using N, or 2, as the denominator) But remember, SD is an estimate. To “correct” for the fact that you don’t know what the real population looks like, you fudge upwards. Now, if you have a sample of 2 (which is a sucky sample), N-1 becomes 1 and wow, suddenly you’re taking that 88 you’d expect and it’s DOUBLED. You know the sample mean difference is 88, but statistics rules make you say the SD is 176. That seems like an abominably inflated measurement, but hey, you’ve got a lousy sample about which you can’t have a lot of confidence. So to be sure 65% of the cases in the population fall within one SD? It’s going to be a BIG SD. In this case, actually double the mean difference.

Remember, though, most of the time you’re talking about bigger numbers. When N is 50, for example, the difference is dividing by 49 instead of 50. Not such a big difference. It’s a smaller fudge. The bigger the sample, the less consequential the difference between using N vs. N-1 in the denominator. And that’s as it should be. Have a small sample, and the effect of knocking 1 off the denominator is much bigger.

Shoot, should I have been using small n’s throughout this discussion?

Zen and brad’s comments jogged my memory. When you divide by N, you have a bias to the sample estimate of the SD. Dividing by N-1 removes the bias.