Standard deviation is one of the basic tools within a statistician’s toolchest, to measure variability within a data set. The basic formula can be found on wikipedia. Within the formula, one of the summations takes every data point subtracts the average value and then squares it. This blog will explore how this squaring affects standard deviation.
Let’s use a simple integer example: 2, 4, 3, 7. The average value is (2+4+3+7) / 4 = 4. The standard deviation is computed as: (2-4)^2 + (4-4)^2 + (3-4)^2 + (4-7)^2= 4 + 0 + 1 + 9 = 14, divided by 3 is 4.67, then a square-root of produces 2.16. This data set shows that values near the average or at that average are not contributing nearly as much to the sum, as the values further away from the average.
Let’s use a simple integer example: 1, 7, 8, 9, 15. The average value is (1+7+8+9+15) / 5 = 8. The standard deviation is computed as: (1-8)^2 + (7-8)^2 + (8-8)^2 + (9-8)^2 + (15-8)^2= 49 + 1 + 0 + 1 + 49 = 100, divided by 3 is 33.3, then a square-root of produces 5.8.
Squaring the difference of 0 or 1, still results in a value of 0 or 1, as if squaring didn’t happen. Differences bigger than one grow non-linearly faster – at a rate of square of the difference. Thus, the bigger the difference from the average value, the exponentially bigger the influence on the summation! The squared part is the problematic one, as the bigger difference values get weighted more than the smaller difference values.
Another way to look at the core summation within the standard deviation is by thinking of it as a weighted sum, with each difference value also being its own weight.
w0 * d0 + w1 * d1 + w2 * d2 + … wN * dN
in the above equation, each weight (w*) would be equal to the value of the difference (d*). Thus, bigger differences are weighted more then smaller differences. This type of an equation is weighted/biased heavily toward the larger differences.
For example, a single difference of 100 will result in a contribution to the summation of 10,000, whereas one hundred difference of 1 will contribute 100. Thus, a single difference of 100 contributes 10,000 times more than a single difference of 1. This single large difference drowns out many smaller differences.
In summary, for integer values, differences of 0 and 1 are not affected by squaring, whereas differences of 2 or more are weighted non-linearly higher, biasing the sum toward larger differences. Thus, the standard deviation computation is biased by larger differences.
For floating-point data sets, the non-linear weighting of values gets more severe. These data sets will have differences not only of 0 and 1, but also in between 0 and 1. These differences will get smaller when squared, de-emphasizing their contribution. While differences greater than 1 will still get over-emphasized.
For example, a difference of 100, when squared results in 10,000. A difference of 0.01, when squared results in 0.0001. The ratio between these two is 100 Million – eight orders of magnitude – with a single difference of 100 dwarfing potentially millions of small differences. Thus, differences larger than 1 get larger, and differences smaller than one get smaller – non-linearly (exponentially) – with values above 1 emphasized and values below 1 de-emphasized exponentially. Values of 0 and 1 are still the only values that are unaffected by squaring.
Let’s use the following data set: 1.0, 3.99, 4.0, 4.01, 7.0. The average value is (1.0 +3.99+4.0+4.01+7.0) / 5 = 4. The standard deviation is computed as: (1-4)^2 +(3.99-4)^2 + (4-4)^2 + (4.01-4)^2 + (7-4)^2 = 9 + 0.0001 + 0 + 0.0001 + 9 = 18.0002, divided by 4, results in 4.5, square-root is 2.12.
Squaring is used is used in standard deviation computation for two reasons:
– as an absolute value, to turn all differences positive
– to provide the ability to find minima/maxima by taking the first derivative
These are powerful reasons, which provide powerful tools when they are needed. However, we need to understand what we are doing when we square and the side-effects that occur, and truly understand the drawbacks when using this technique.
What other ways are available, which don’t warp the data set. One such measurement is Mean Absolute Deviation, which uses absolute value instead of squaring, and has no square-root. This measurement gives up the ability to find minima/maxima, but returns a more meaningful result, that is not biased toward larger differences and not biased away from differences smaller than 1. Each value has a weight of 1 in the summation, with all weights being equal. Differences are not emphasized/biased toward, no matter what their value is.
For our first example above: (1-8) + (7-8) + (8-8) + (9-8) + (15-8) = 7 + 1 + 0 + 1 + 7 = 16, divided by 3 is 5.3.
For our second example above: (1-4) +(3.99-4) + (4-4) + (4.01-4) + (7-4) = 3 + 0.0001 + 0 + 0.0001 + 3 = 6.0002, divided by 4 is 1.5.
The ratio between the mean absolute deviation and the standard deviation in the two examples provided are not large, but for other data sets they could be much more significant: 5.3 / 5.8 = 0.9, and 1.5 / 2.12 = 0.7. It’s interesting that mean absolute deviation is smaller. Is this always true for integer sets? Seems like it might be. For floating-point sets seems unlikely especially for ones with differences between 0 and 1.
Wikipedia states that for the normal distribution the value of the mean absolute deviation is 0.8 * standard deviation. What about for other distributions or even any other data set? Finding a closed-form solution seems difficult, as mathematics doesn’t know how to deal with absolute value well, due to its discontinuity at zero.
Problem: Find data sets with maximal percentage difference between the standard deviation and mean absolute deviation. What are the attributes of these data sets? Why do they produce such large differences?