Using the quadratic mean for standard deviation

So wanting to clear up a misconception I’ve heard many teachers (and college professors say). They say that we do not use average deviation from the mean to measure the spread of data as the positive values and negative values cancel each other so you need to square the values to get rid of the negative sign. But that is not true as one would use the absolute values and so all of the values would be non-negative anyways.

I saw a mention of the reason we use the quadratic mean for measuring spread instead of arithmetic mean as a footnote in a stats book and at the time I glanced at it as it was very technical - basically the quadratic mean follows some statistical theory rule while arithmetic mean doesn’t. I cannot find that book now so I present it to you SD statisticians: Why do we use the quadratic mean for standard deviation?

Are you familiar with the distance formula? The standard deviation of a set of n data points is essentially the distance in n-dimensional space between your set of data, and a set of data values that had the same mean but were all equal to each other.

Quantities like the median absolute deviation and other average absolute deviations are used as a measure of statistical dispersion, though.

As you note, we don’t need use standard deviation (SD) because it leads to all positive numbers during the summing. Other things can do that, too.

There are a number of ways to think about why SD is sensible. A sprinkling:

(1) As @Thudlow_Boink mentions, you can translate this problem into a problem that asks “How far away is this data set, in distance, from a central point?” Doing it with only two data points can be drawn in 2D, which I will try here. (Aside: is there a way to do “code” formatting but not have it try to mark-up things it thinks are keywords, like “and” and “1”? I just want a fixed-width font, not actual code formatting.)

* = some central point, like the mean but doesn't have to
be the mean, that we are measuring deviations from.

Here are two data sets (A and B) with 1 data point each:


              --B-------------*---------A--------
                   deviation in data point 1

Which data set, in its entirety, is further from the star? Since there’s only one data point, it’s just the distance of that single data point from the star, so data set “B” is further than data set “A”.

Here are two data sets (A and B) with 2 data points each:

              -----------------------------------
              |                                 |
              |                                 |
              |                                 |
              |                          A      |
              |                                 |
  deviation   |                                 |
   in data    |          B                      |
   point 2    |               *                 |
              |                                 |
              |                                 |
              |                                 |
              |                                 |
              |                                 |
              |                                 |
              -----------------------------------
                   deviation in data point 1

Which data set is further from the star? Well, the one furthest from the star in this “deviation” space. But since there are two data points in a given data set, we have a 2D deviation space. And to figure out which data set (in full) is further away from the star, we need to do some Pythagorean theorem work. That is, the distance that data set A is from the star is:

sqrt [ (deviation of its data point 1)2 + (deviation of its data point 2)2 ]

This generalizes to any number of data points. How “far” the entire data set is from a reference point is the square root of the sum of the squares of the deviations for each data point. You don’t actually have to take the square root to see which is further, of course, but that nicely gives you the same units as the data again, which can be nice (and can be cleanly thought of as a distance). Also, one usually divides by the number of points to get the average squared deviation per point (since data sets will have different numbers of data points).

(2) Note: the above quantity is the square root of the mean of the square deviations about some reference point, and this is shortened to the name “root mean square”, or just RMS.

(3) A different (but fundamentally related) way to look at the problem is to note the central limit theorem, which in very rough terms says that, for typical situations occurring naturally, the sort of randomness that leads to deviations in data often follows a normal distribution, a.k.a. a Gaussian distribution, for the deviations. So, you often get your prototypical bell curve.

Say you want to compare two data sets whose data follow a normal distribution to find out which data set is more “spread out”. A very sensible thing to compare is the width of their respective normal distributions, which is given by the Gaussian parameter sigma, which is also called its standard deviation, which is also equal to the RMS quantity above.

(4) The standard deviation (or its square, called the variance) has lots of nice properties that are a consequence of all of the above. Weighted averages behave as you would want them to if the weights are 1/(variance). Error propagation does what its supposed to in suitably linear systems when you use standard deviation.

There’s something inherently ugly about the absolute value function. It’s not differentiable. It doesn’t fall into a nice category of functions like polynomials. It has to be constructed piecewise.

If you look to physics, quantities that have to be positive (when made from those that can be negative) get squares, not absolute values. Energy has to be positive, but velocity does not; hence we have KE=mv^2/2. The same principle applies elsewhere.

Sure, there are some times and places where abs() can be useful. But it always seems to be less universal than when there is a square.

One point is, if you have a sample of data and wish to estimate the population mean or standard deviation, those may be sensitive to outliers. Whereas using the interquartile range or median absolute deviation is more robust. Or compare using least absolute deviations instead of least-squares for regression. Presumably if you are a statistician you know what you are doing with all of these available methods :slight_smile:

You can use absolute values of deviations, and there are some benefits to it, as mentioned, but also some drawbacks. Another drawback is that it doesn’t lead to unique solutions. You don’t get a single best fit; you get a whole family of them, with a lot of wiggle room. For the simplest example, if you have two one-dimensional measurement points, and want to make a best guess at the true value, if you’re using absolute value of deviation, then any point in between those two points will be an equally good match. Using squared deviations, though, your best guess from those two points will be their average.

A historical note. The standard deviation, as a measure of statistical variation, was invented by Sir Francis Galton. He and Henry Watson also devised and analyzed one of the first Markov processes to be studied mathematically, the Galton-Watson process. This process models the evolution of family names passed paternally, as was the practice in England. He and others were concerned that the “good” family names would die out because the lower classes were having more children. He also coined the term “eugenics”, which was a particular passion of his.

For a normal distribution with a mean, mu, and a standard deviation, sigma, the arithmetic average is an unbiased estimate of mu. The standard deviation is an unbiased estimate of sigma. Someone else mentioned the central limit theorem: when taking a random sample from almost any distribution, the mean of the sample tends towards being a normal distribution. Since the sample mean tends towards being normal, the above about being unbiased tends to be true.
Not getting into too many details, besides being unbiased, the estimates given above are also uniformly minimum variance estimates. Unbiased and uniformly minimum variance estimates are as good as you can hope for in statistics.

Although, the cases outside of the Central Limit Theorem are uncomfortably common in a lot of applications.

Indeed. Another common distribution is the log-normal distribution, which occurs in many natural processes. It is also called the Galton distributions, after Sir Francis Galton. Loosely, the normal distribution is the limit for any random process that is the sum of a large number of independent random processes. For a process that is the sum of the logarithms of a larger number of process, you get the log-normal distribution. There are many other commonly occurring distributions.

I think that some people are missing the point of the central limit theorem. For a random samples (independent and identically distributed, or i.i.d.), the sampling distribution of the sample mean tends toward the normal distribution, even if the original variables are not normally distributed. I believe that the only reason that this would not be true is if the original distribution had an undefined mean and/or standard deviation.
So the sampling distribution of the mean of i.i.d. samples from a lognormal distribution tend towards the normal distribution. The lognormal distribution is not normal (obviously), but the mean of samples from a lognormal distribution tends towards being normally distributed.

As an interesting side note, is that there is a direct relationship between the mean and using the squared deviation as a distance metric. In that if you have a set of data, look for a value that minimizes the average squared distance of the data to that value, you find that the optimum value to choose is the mean of the data.

Now if we change the distance metric to the absolute deviation and ask what is the value that minimizes the sum of the absolute deviations of the data to that value, you find that the optimum value to choose is the median of the data.