Parallel Standard Deviation

Standard deviation is a core statistical algorithm used to measure variability of a data set. It is used in data science extensively, to provide useful information about the data. The computation itself uses summation twice within the algorithm: once to compute the mean (average) of the data set, and another to sum the square of the difference between each data item and the mean (see the equation on Wikipedia).

Summation of arrays is something that the HPCsharp library in C# (free and open source) implements very well. It provides parallel versions of Array.Sum() for all C# numeric data types. These implementations use data-parallel SIMD/SSE instructions on each processor core, and also use all of the cores.

HPCsharp summation implementations are also safer to use than the standard C# Array.Sum(), because they avoid numeric overflow exceptions internally, while providing perfectly accurate results integer results, at full speed.

Performance

The following benchmarks ran on a six-core Dell G5 laptop with an i7-9750H processor.

Standard deviation of integer arrays:
– 0.5 GigaIntegers/sec for Linq-style multi-core implementation
– 3.4 GigaIntegers/sec for SSE, multi-core HPCsharp implementation

Standard deviation of long arrays:
– 0.48 GigaLongs/sec for Linq-style multi-core implementation
– 1.8 GigaLongs/sec for SSE, multi-core HPCsharp implementation

Standard deviation of ulong arrays:
– 2.0 GigaUlongs/sec for SSE, multi-core HPCsharp implementation

Standard deviation of float arrays:
– 0.49 GigaFloats/sec for Linq-style multi-core implementation
– 4.0 GigaFloats/sec for SSE, multi-core HPCsharp implementation

Standard deviation of double arrays:
– 0.49 GigaDoubles/sec for Linq-style multi-core implementation
– 2.1 GigaDoubles/sec for SSE, multi-core HPCsharp implementation

Standard C# does not support .Sum() for ulong[], and is the reason there is no comparison for standard deviation of ulong[] arrays.

Arithmetic Overflow

The above performance comparisons for integer and long arrays is fair if we know that the values within that array are small enough for .Sum() to never cause arithmetic overflow. However, if the values within the array are large enough to cause arithmetic overflow, then before performing the .Sum() we need to cast to a data type that is bigger, and is big enough to never overflow. For instance, for integer array .Sum(), we would first cast to long, as in:

myArray.Sum( v => (long)v );
myArray.Average( v => (long)v );

For long[] or ulong[] arrays, we would first cast to Decimal, which has 96-bits of accuracy.

myArray.Sum( v => (Decimal)v );
myArray.Average( v => (Decimal)v );

The above versions, support any value within these arrays. For instance, in the integer array, every value can be Int32.MaxValue. For long array, every value can be Int64.MaxValue. Thus, we are safe with these implementations to never cause arithmetic overflow exception and to produce correct summation or average computation result.

HPCsharp .Sum() implementations do this internally, even when using SSE data parallel instructions. HPCsharp uses these safer .Sum() versions in the standard deviation implementations. The benchmarks above is for these safe implementations, which will not cause arithmetic overflow. Let’s see how much slower the safe versions of standard C# implementations of standard deviation:

Standard deviation of integer arrays, with cast to long to avoid arithmetic overflow:
– 0.34 GigaIntegers/sec for Linq-style multi-core implementation
– 3.4 GigaIntegers/sec for SSE, multi-core HPCsharp implementation

Standard deviation of long arrays, with cast to decimal to avoid arithmetic overflow:
– 0.25 GigaLongs/sec for Linq-style multi-core implementation
– 1.8 GigaLongs/sec for SSE, multi-core HPCsharp implementation

Standard deviation of ulong arrays, with cast to decimal to avoid arithmetic overflow:
– 0.22 GigaUlongs/sec for Linq-style multi-core implementation
– 2.0 GigaUlongs/sec for SSE, multi-core HPCsharp implementation

Conversion to a larger data type during the average computation slowed down performance by nearly 2X when using standard C#. HPCsharp implementation avoidance of arithmetic overflow exception is built-in.

More Accurate Floating-Point Computation

HPCsharp provides two options when computing standard deviation of float[] arrays:
– use float throughout the entire computation, for faster but less accurate result
– convert to doubles for more accurate summation and computations

These two options provide a trade-off of performance for accuracy, and both use SSE instructions within each core, as well as multi-core, for additional performance thru parallel computing.

More Benchmarks

The following benchmarks ran on a 14-core Xeon W-2175 processor with four channels of memory.

Standard deviation of integer arrays:
– 0.44 GigaIntegers/sec for Linq-style multi-core implementation
– 4.9 GigaIntegers/sec for SSE, multi-core HPCsharp implementation

Standard deviation of long arrays:
– 0.29 GigaLongs/sec for Linq-style multi-core implementation
– 2.2 GigaLongs/sec for SSE, multi-core HPCsharp implementation

Standard deviation of ulong arrays:
– 3.6 GigaUlongs/sec for SSE, multi-core HPCsharp implementation

Standard deviation of float arrays:
– 0.58 GigaFloats/sec for Linq-style multi-core implementation
– 6.5 GigaFloats/sec for SSE, multi-core HPCsharp implementation

Standard deviation of float arrays, with all computing in doubles:
– 0.47 GigaFloats/sec for Linq-style multi-core implementation
– 5.9 GigaFloats/sec for SSE, multi-core HPCsharp implementation

Standard deviation of double arrays:
– 0.53 GigaDoubles/sec for Linq-style multi-core implementation
– 3.7 GigaDoubles/sec for SSE, multi-core HPCsharp implementation

On this computer, most of the HPCsharp standard deviation algorithms are about 10X faster than ones implemented using standard C# in Linq-style.

Future Plans

More floating-point accuracy will be coming soon, by using Kahan floating-point summation, as .Sum() in HPCsharp currently implements. Kahan summation offers single epsilon computational error, for float[] array and double[] array summation, with only a slight performance degradation, while still using the full power of SSE instructions and multi-core.

 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s