Population vs. Sample variance?

Please explain why you divide by n for the population variance, but you divide by n-1 for the sample variance.

The explanation I hear is that you do this because of the difference between the sample mean and the population mean and that subtracting by 1 somehow corrects for this.

If you divide by n, the expected value of the sample variance is not the population variance. It’s n/(n - 1) times the population variance. That’s why you do it that way.

When you use sample variance, you usually want it as an “unbiased” estimator of population variance.

Dividing simply by n gives you the sample variance, but it’s not an unbiased estimate of the population variance because you are using the sample mean within the formula to also estimate the population mean.

The n-1 comes in because Student’s t distribution more accurately reflects the distribution than a normal distribution. As n gets larger and larger t becomes more like a normal distribution, but then the minus 1 has little effect on the computation as well. So the n-1 has been generally agreed upon.

The premise goes something like:

If we know the average value of a sample of 5 observations, and we know what four of these observations are, we can calculate the fifth, i.e.:

Samples: 10,5,7,12,11
Average: 9

If you were given the average and four of the observations, the fifth would follow automatically (Average 9, plus 5 observations -> a total of 45, minus (10+5+7+12) = 11), i.e. the last observations is not completely independent of the others. We have lost a “degree of freedom” and must adjust the n accordingly. When the sample size gets very large, this n-1 correction becomes trivial and your sample curve starts to approximate the actual population values.

Exactly, Brian, the explanation in terms of degrees of freedom is the one I was thinking of as well.

Knowing the average and using it in yoiur calculations, makes you loose one degree of freedom, i.e. one of the values/observations is not free to vary anymore. So you have to adjust you n accordingly.

Sometimes, when there is even more information used (in statistical calculations) the degrees of freedom decrease even more an you see n-2 or n-3 in the formula.