Standard deviation is a core statistical algorithm used to measure variability of a data set. It is used in data science extensively, to provide useful information about the data. The computation itself uses summation twice within the algorithm: once to compute the mean (average) of the data set, and another to sum the square of the difference between each data item and the mean (see the equation on Wikipedia).
Summation of arrays is something that the HPCsharp library in C# (free and open source) implements very well. It provides parallel versions of Array.Sum() for all C# numeric data types. These implementations use data-parallel SIMD/SSE instructions on each processor core, and also use all of the cores.
HPCsharp summation implementations are also safer to use than the standard C# Array.Sum(), because they avoid numeric overflow exceptions internally, while providing perfectly accurate results integer results, at full speed.
Performance
The following benchmarks ran on a six-core Dell G5 laptop with an i7-9750H processor.
Standard deviation of integer arrays:
– 0.5 GigaIntegers/sec for Linq-style multi-core implementation
– 3.4 GigaIntegers/sec for SSE, multi-core HPCsharp implementation
Standard deviation of long arrays:
– 0.48 GigaLongs/sec for Linq-style multi-core implementation
– 1.8 GigaLongs/sec for SSE, multi-core HPCsharp implementation
Standard deviation of ulong arrays:
– 2.0 GigaUlongs/sec for SSE, multi-core HPCsharp implementation
Standard deviation of float arrays:
– 0.49 GigaFloats/sec for Linq-style multi-core implementation
– 4.0 GigaFloats/sec for SSE, multi-core HPCsharp implementation
Standard deviation of double arrays:
– 0.49 GigaDoubles/sec for Linq-style multi-core implementation
– 2.1 GigaDoubles/sec for SSE, multi-core HPCsharp implementation
Standard C# does not support .Sum() for ulong[], and is the reason there is no comparison for standard deviation of ulong[] arrays.
Arithmetic Overflow
The above performance comparisons for integer and long arrays is fair if we know that the values within that array are small enough for .Sum() to never cause arithmetic overflow. However, if the values within the array are large enough to cause arithmetic overflow, then before performing the .Sum() we need to cast to a data type that is bigger, and is big enough to never overflow. For instance, for integer array .Sum(), we would first cast to long, as in:
myArray.Sum( v => (long)v );
myArray.Average( v => (long)v );
For long[] or ulong[] arrays, we would first cast to Decimal, which has 96-bits of accuracy.
myArray.Sum( v => (Decimal)v );
myArray.Average( v => (Decimal)v );
The above versions, support any value within these arrays. For instance, in the integer array, every value can be Int32.MaxValue. For long array, every value can be Int64.MaxValue. Thus, we are safe with these implementations to never cause arithmetic overflow exception and to produce correct summation or average computation result.
HPCsharp .Sum() implementations do this internally, even when using SSE data parallel instructions. HPCsharp uses these safer .Sum() versions in the standard deviation implementations. The benchmarks above is for these safe implementations, which will not cause arithmetic overflow. Let’s see how much slower the safe versions of standard C# implementations of standard deviation:
Standard deviation of integer arrays, with cast to long to avoid arithmetic overflow:
– 0.34 GigaIntegers/sec for Linq-style multi-core implementation
– 3.4 GigaIntegers/sec for SSE, multi-core HPCsharp implementation
Standard deviation of long arrays, with cast to decimal to avoid arithmetic overflow:
– 0.25 GigaLongs/sec for Linq-style multi-core implementation
– 1.8 GigaLongs/sec for SSE, multi-core HPCsharp implementation
Standard deviation of ulong arrays, with cast to decimal to avoid arithmetic overflow:
– 0.22 GigaUlongs/sec for Linq-style multi-core implementation
– 2.0 GigaUlongs/sec for SSE, multi-core HPCsharp implementation
Conversion to a larger data type during the average computation slowed down performance by nearly 2X when using standard C#. HPCsharp implementation avoidance of arithmetic overflow exception is built-in.
More Accurate Floating-Point Computation
HPCsharp provides two options when computing standard deviation of float[] arrays:
– use float throughout the entire computation, for faster but less accurate result
– convert to doubles for more accurate summation and computations
These two options provide a trade-off of performance for accuracy, and both use SSE instructions within each core, as well as multi-core, for additional performance thru parallel computing.
More Benchmarks
The following benchmarks ran on a 14-core Xeon W-2175 processor with four channels of memory.
Standard deviation of integer arrays:
– 0.44 GigaIntegers/sec for Linq-style multi-core implementation
– 4.9 GigaIntegers/sec for SSE, multi-core HPCsharp implementation
Standard deviation of long arrays:
– 0.29 GigaLongs/sec for Linq-style multi-core implementation
– 2.2 GigaLongs/sec for SSE, multi-core HPCsharp implementation
Standard deviation of ulong arrays:
– 3.6 GigaUlongs/sec for SSE, multi-core HPCsharp implementation
Standard deviation of float arrays:
– 0.58 GigaFloats/sec for Linq-style multi-core implementation
– 6.5 GigaFloats/sec for SSE, multi-core HPCsharp implementation
Standard deviation of float arrays, with all computing in doubles:
– 0.47 GigaFloats/sec for Linq-style multi-core implementation
– 5.9 GigaFloats/sec for SSE, multi-core HPCsharp implementation
Standard deviation of double arrays:
– 0.53 GigaDoubles/sec for Linq-style multi-core implementation
– 3.7 GigaDoubles/sec for SSE, multi-core HPCsharp implementation
On this computer, most of the HPCsharp standard deviation algorithms are about 10X faster than ones implemented using standard C# in Linq-style.
Future Plans
More floating-point accuracy will be coming soon, by using Kahan floating-point summation, as .Sum() in HPCsharp currently implements. Kahan summation offers single epsilon computational error, for float[] array and double[] array summation, with only a slight performance degradation, while still using the full power of SSE instructions and multi-core.