As you note, we don’t need use standard deviation (SD) because it leads to all positive numbers during the summing. Other things can do that, too.
There are a number of ways to think about why SD is sensible. A sprinkling:
(1) As @Thudlow_Boink mentions, you can translate this problem into a problem that asks “How far away is this data set, in distance, from a central point?” Doing it with only two data points can be drawn in 2D, which I will try here. (Aside: is there a way to do “code” formatting but not have it try to mark-up things it thinks are keywords, like “and” and “1”? I just want a fixed-width font, not actual code formatting.)
* = some central point, like the mean but doesn't have to
be the mean, that we are measuring deviations from.
Here are two data sets (A and B) with 1 data point each:
--B-------------*---------A--------
deviation in data point 1
Which data set, in its entirety, is further from the star? Since there’s only one data point, it’s just the distance of that single data point from the star, so data set “B” is further than data set “A”.
Here are two data sets (A and B) with 2 data points each:
-----------------------------------
| |
| |
| |
| A |
| |
deviation | |
in data | B |
point 2 | * |
| |
| |
| |
| |
| |
| |
-----------------------------------
deviation in data point 1
Which data set is further from the star? Well, the one furthest from the star in this “deviation” space. But since there are two data points in a given data set, we have a 2D deviation space. And to figure out which data set (in full) is further away from the star, we need to do some Pythagorean theorem work. That is, the distance that data set A is from the star is:
sqrt [ (deviation of its data point 1)2 + (deviation of its data point 2)2 ]
This generalizes to any number of data points. How “far” the entire data set is from a reference point is the square root of the sum of the squares of the deviations for each data point. You don’t actually have to take the square root to see which is further, of course, but that nicely gives you the same units as the data again, which can be nice (and can be cleanly thought of as a distance). Also, one usually divides by the number of points to get the average squared deviation per point (since data sets will have different numbers of data points).
(2) Note: the above quantity is the square root of the mean of the square deviations about some reference point, and this is shortened to the name “root mean square”, or just RMS.
(3) A different (but fundamentally related) way to look at the problem is to note the central limit theorem, which in very rough terms says that, for typical situations occurring naturally, the sort of randomness that leads to deviations in data often follows a normal distribution, a.k.a. a Gaussian distribution, for the deviations. So, you often get your prototypical bell curve.
Say you want to compare two data sets whose data follow a normal distribution to find out which data set is more “spread out”. A very sensible thing to compare is the width of their respective normal distributions, which is given by the Gaussian parameter sigma, which is also called its standard deviation, which is also equal to the RMS quantity above.
(4) The standard deviation (or its square, called the variance) has lots of nice properties that are a consequence of all of the above. Weighted averages behave as you would want them to if the weights are 1/(variance). Error propagation does what its supposed to in suitably linear systems when you use standard deviation.