Why half-life and 1 standard devation = 68%?

Plenty of examples, including average electrical power, amplifier power efficiency, and others from physics and engineering.

The simplest motivation for RMS is that you need to do something to the deviations to make them all positive (or you’re just going to get 0 for the average deviation), and a square is the simplest smooth function that does that.

Quoth ultrafilter:

Not true. I could take two Lorentzian distributions with completely different interquartile ranges, and truncate them way the heck out on the tails so they both have finite but extremely large standard deviations. If I choose my truncation points correctly, I can give them exactly the same standard deviation as each other, without any significant change to their (very different from each other) interquartile ranges. There’s literally no limit on how different the interquartile ranges can be, for two distributions with the same standard deviation.

And the fact that standard deviation is significantly effected by the entire data set can sometimes be a bug, not a feature. Sometimes you want to ignore, or at least deweight, the outliers.

…of course absolute value is conceptually clearer and Mean Absolute Deviation is more intuitive as well. But the calculus tends to be a lot messier than for squares. (And if you’re running a regression, AAD tends to produce ambiguous intercepts, IIRC. But that can be finessed, IME. )

High speed computers (like, oh, the 386) can make some of these issues go away. Working with median absolute deviations is sometimes known as robust regression, as it resists undue effects from outliers. I opine that AAD and MAD deserve more attention.

I think the simplest motivation for RMS is the Pythagorean Theorem. You can think of random variables as vectors in a high-dimensional Euclidean space, with the RMS of a variable being its length.

Furthermore, one dimension of that space will correspond to constant variables, while the perpendicular dimensions correspond to variables of mean zero. Thus, every variable will uniquely decompose into the sum of two components, a constant variable component and a mean-zero variable component. The lengths of these two components are the variable’s mean and standard deviation, respectively.

(E.g., variables measured over a population of size N, being lists of N numbers, are quite readily thought of as N-dimensional vectors with the indicator variables for each member of the population comprising an orthogonal, equal-length basis. Taking unit length to be that of the constant unit variable, every variable’s length becomes identified with its RMS, by the familiar Pythagorean Theorem. [Granted, one might occasionally want to consider non-Euclidean norms as well, but the advantages of the Euclidean norm and attendant theory of angles are familiar, at least from a geometric perspective])

And this is a large part of the reason why people like least squares: we understand what’s going on geometrically.

I would say that’s the most fundamental motivation for it, but most people don’t regard N-dimensional vector spaces as “simple”. Of course, similar motivations also lead to the Gaussian distribution being so widespread: The standard deviation is, in some real sense, the most natural width measure for the Gaussian (but not so natural for other distributions).

As I mentioned up in post 19, other distributions have other measures as being most natural, by the way.

Oh, and Gaussian distributions and standard deviations are also related to the practice of using least-squares fitting to get a best-fit line (or other curve). Least-squares fitting is the optimum fitting routine for data with Gaussian errors.

This sounds interesting. What exactly is meant by it, though? Could you say some more about the sense in which these various norms are the natural measures of dispersion for these various distributions?

(Specifically, I was curious as to whether the sense of the correspondence was that there is some principled, uniform way of turning norms into distributions which turns the L1, L2, and L-infinity norms into the Laplacian, Gaussian, and uniform distributions, respectively. On reflection, though, I suppose the sense is actually the converse, that turning distributions into norms in such and such a way (apparently, via maximum likelihood whatnot) sends those distributions to those norms, but with other distributions being sent to those same norms as well)

The standard deviation L2, is proportional to the free parameter describing the dispersion in the equation for the gaussian distribution, the mean absolute deviation (L1) is proportional to the free parameter describing the spread in a Laplacian distribution, and the maximum absolute deviation is equal to the range of the uniform distribution (from mean to maximum or minimum value).

Also, if you minimize L2 in a curve fitting problem, you get an unbiased estimate if the noise in your data is gaussian distributed, if you minimize L1 you get an unbiased estimate if the noise is Laplacian, and if you minimize Linf you get an unbiased estimate if the noise is uniform, while if you use the “wrong” norm for the noise, you’ll probably get bad estimates.

I’m guessing that it has to do with maximum likelihood estimation of their parameters (assuming that the uniform distribution is on [0, heta]). The dispersion parameter of the normal distribution is \sigma^2, and the maximum likelihood estimator is the mead squared deviation. The dispersion parameter of the Laplace distribution is denoted as b, and its maximum likelihood estimator is the mean absolute deviation. The dispersion parameter of the uniform is heta, and its maximum likelihood estimator is proportional to the maximum observation.

It’s a nice way of looking at things, but it doesn’t really tell you how to assign measures of dispersion to other distributions. There is a general notion of a dispersion parameter for exponential families, but that still doesn’t cover something like a Pareto or a Cauchy.

On preview: Beaten like a dead horse. Oh well.

Chronos: I’ve been looking for the result that I mentioned, but I can’t find anything. It may be that it’s specific to exponential families, or that I’m just making it up, but I’ll keep digging.

I appreciated seeing your description anyway - it’s nice to see things expressed in slightly different ways. Leads to more robust understanding…

For a linear model where the errors have zero mean, equal variances and no correlation, the least squares estimate is the best linear unbiased estimator by the Gauss-Markov theorem.

Note that a linear model is linear in the parameters, not the variables. Anything of the form g(y) = \beta_0 + \beta_1f_1(x) + \beta_2f_2(x) + … + \beta_kf_k(x) is linear as long as the functions g, f_1, f_2, …, f_k are known.

Edit: Also, a variant of least squares is BLUE when the covariance matrix of the errors is known (or known up to a constant) but not diagonal.

You’re right - I was oversimplifying. The fact that L2 estimate gives good results (by its own standard) for a wide range of noise types is another reason why L2 is so useful.

This proportionality argument has no substance, though; for Gaussian distributions, the the L1 deviation is also proportional to the dispersion parameter, as the L1 deviation is proportional to the L2 deviation [as sqrt(2/π)σ], and so on for the rest of it. Indeed, tautologically, any measure of dispersion will be proportional to any dispersion parameter in any distribution where it is well-defined, where “measure of dispersion” and “dispersion parameter” are both understood to be things which scale in the natural way as the distribution is stretched wider or thinner (and the family of distributions being considered only varies in this way).

The maximum likelihood estimation argument is more compelling.

Yep. I should have said what ultrafilter said in post 33.