C# Linq provides a convenient way to add up all of the array elements.

var arr = new int[] { 5, 7, 16, 3 }; int arrSum = arr.Sum();

This .Sum() supports some of the built-in numeric types: int, long, float, double, and decimal. It provides a nicely consistent and convenient way of adding up all of the values within an array.

For each data type, C# produces a sum of the same data type. For instance, when adding up an array of int values, the sum is also an int value, as seen in the code above. The .Sum() throws an overflow exception whenever adding all of the array elements exceeds the maximum value of that numeric data type – Int32.MaxValue in the above example.

## The Overflow Problem

For int[] this overflow exception occurs much more often than expected. For instance, when adding up random int values that are over a full positive range of int, overflow happens quite often. Overflow can even happen with an array of just two elements:

var arr = new int[] { Int32.MaxValue, 1 }; int arrSum = arr.Sum();

This behavior makes integer array .Sum() problematic and difficult to use. Dealing with overflow exceptions complicates this simple summation.

## Eliminating Overflow

It’s possible to eliminate overflow for .Sum() of integer arrays. By using a larger data type for the accumulator during the .Sum() all of the bits can be preserved and returned to the caller. For instance, a long, which is 64-bits in size can be used as the accumulator and be returned.

In C#, array index is limited to an int data type, which is 32-bits. Thus, when accumulating an array of int’s, we would need 32-bits plus 32-bit – i.e. 64-bits total – in the accumulator, to avoid overflow. A long, which is 64-bits, provides just enough bits to avoid overflow entirely.

The .Sum() implementation in the HPCsharp nuget package uses a long integer data type as an internal accumulator and returns a long integer summation result. This method can never throw an overflow exception.

## Supporting Unsigned and Other Numeric Types

C# Linq .Sum() supports only a limited set of signed integer data types: int and long. Other C# signed integer data types, such as sbyte and short, along with unsigned integer data types, such as byte, ushort, uint, and ulong, are not supported.

HPCsharp adds implementation of .Sum() for all of these integer data types: signed and unsigned. For arrays of unsigned integers, such as byte, ushort, uint and ulong, .Sum() returns a ulong, which is the largest unsigned integer data type available. For signed integer data types of sbyte and short, HPCsharp .Sum() returns a long, which is the largest signed integer data type.

For 8-bit, 16-bit and 32-bit integer data types, HPCsharp avoids overflow exception by design. However, Linq .Sum() of long arrays, and HPCsharp .Sum() of long and ulong arrays, can still throw an overflow exception, because there is no larger integer data type available.

## Eliminating Overflow Exception for Long and ULong

Long and Ulong are 64-bit and are the largest integer data types. When using a long accumulator during .Sum() overflow exception is possible and likely in the case of accumulating random 64-bit values. If a double, which has 53 bits of precision, is used as an accumulator, then 11 bits could potentially be lost, making double less than an ideal choice.

Ah, but C# has another numeric data type, which has 96 bits of precision – decimal. Decimal provides just enough bits for accumulating long or ulong integer arrays. Since long and ulong are 64 bits and C# indexes are 32-bits, an accumulator would need 96 bits of precision to avoid overflow. The decimal data type provides exactly 96 bit of precision, making it a wonderful choice, giving C# an advantage over other languages.

While accuracy of .Sum() for long and ulong, when using a decimal accumulator, will be perfect, the performance will be much slower, since modern processors do not support the decimal data type in hardware. All of the arithmetic operations on the decimal data type are implemented in software.

HPCsharp provides these more accurate .Sum() functions for long and ulong integer arrays. Performance is 25% faster than Linq .Sum() of decimal arrays and uses half the memory. Single-core and multi-core versions are available.

## Data Parallel on Single Core

Adding integers and floating-point numbers is a simple operation supported by SIMD/SSE data parallel instructions on most modern processors, such as Intel, AMD and ARM. Some processors support up to 512-bits wide in parallel, which means sixteen 32-bit additions in parallel in a single CPU clock cycle, while running at 3 to 4 GHz.

HPCsharp takes advantage of these SIMD/SSE data parallel instructions to speed up .Sum() for all of the numeric data types, except decimal, to provide even faster performance. On a single core this leads to more than 20X performance boost over Linq .Sum() for integer arrays. Similar speedups are achieved for other data types.

Lib |
sbyte |
byte |
short |
ushort |
int |
uint |
long |
ulong |
float |
double |
decimal |
---|---|---|---|---|---|---|---|---|---|---|---|

Linq | n/a | n/a | n/a | n/a | 0.15* | n/a | 0.16* | n/a | 0.17 | 0.17 | 0.03 |

HPCsharp | 2.8 | 3.0 | 3.2 | 3.2 | 3.3 | 3.3 | 2.0* | 2.0* | 3.7 | 1.9 | 0.14 |

## Parallel on Multiple Cores

Modern processors are finally adding more cores beyond 4. Recently, AMD has released a desktop processor with 32 cores, with hyperthreading. Intel has similar ones in the works. By using multiple cores, it’s possible to accelerate .Sum() further. On my quad-core hyperthreaded laptop, this provides 6X speedup over Linq .Sum().AsParallel(), providing beyond 5 GigaInt/sec summation performance, as seen in the following table:

Lib |
sbyte |
byte |
short |
ushort |
int |
uint |
long |
ulong |
float |
double |
decimal |
---|---|---|---|---|---|---|---|---|---|---|---|

Linq | n/a | n/a | n/a | n/a | 0.9* | n/a | 0.9* | n/a | 0.9 | 0.9 | 0.12 |

HPCsharp | 7.6 | 8.0 | 8.0 | 8.2 | 5.0 | 5.3 | 2.9* | 2.8* | 5.1 | 2.9 | 0.14 |

The above table show performance of Linq and HPCsharp .Sum() for arrays of various numeric data types in Billions/second, running on quad-core and using SIMD/SSE instructions. The ones marked with * can throw an overflow exception, and n/a means not available.

## More Accurate Floating-Point .Sum()

To provide more accuracy for .Sum() for float arrays, a double could be used as an accumulator. This more accurate .Sum() version is implemented in HPCsharp. However, for doubles, no higher accuracy data type is available.

Floating-point has unique issues due to the large dynamic range supported. For instance, when adding a very large number with a very small number, the result may be surprising, where the sum ignores the small number entirely.

double[] arrDouble = new double[] { 1, 10.0e100, 1, -10e100 }; double arrSum = arrDouble.Sum();

The .Sum() in the above code using Linq .Sum() returns a zero, because the two large values overwhelm the two 1.0 values, ignoring these small values entirely.

Two algorithms have been developed to deal with this situation, and to provide a more accurate result: Kahan and its Neumaier variation. These algorithms go thru additional steps to avoid losing precision of the summation result when adding floating-point numbers. In the example above, the Neumaier sum algorithm produces the correct result of 2.

HPCsharp implements both of these more accurate algorithms for floating-point summation, for both float and double data types. Raising performance higher, Neumaier .Sum() is also implemented using SIMD/SSE instructions as well as multi-core.

Array |
Accumulator |
Scalar |
SIMD/SSE |
Scalar Multi-Core |
SIMD/SSE Multi-Core |

float[] | float | 0.6 | 2.7 | 1.9 | 4.8 |

float[] | double | 0.6 | 1.4 | 1.7 | 3.9 |

double[] | double | 0.6 | 1.3 | 1.6 | 2.5 |

The table above compares Neumaier .Sum() algorithm performance for scalar, SIMD/SSE single-core, with multi-core and SIMD/SSE/multi-core, in Giga/sec. It shows that using a double accumulator slows performance. It also shows that Neumaier more accurate .Sum() algorithm once implemented with SIMD/SSE and multi-core is only 15% slower than the standard lower accuracy algorithm – 4.8 GigaFloats/sec versus 5.1 GigaFloats/sec, and 2.5 GigaDoubles/sec versus 2.9 GigaDoubles/sec.

## Performance vs Accuracy Trade-off

HPCsharp also implements pairwise floating-point to provide higher accuracy than the naive for loop approach to summation, but not as accurate as the Kahan/Neumaier summation. For more details see the next blog (coming soon)…

## Better .Sum() in How Many Ways?

- HPCsharp speeds up .Sum() by using SIMD/SSE instruction capabilities within each core of the processor.
- HPCsharp .Sum() expands support to unsigned integers and smaller signed integers.
- HPCsharp avoids throwing overflow exception by using a 64-bit accumulator for all integer data types, and decimal when accumulating long[] or ulong[].
- HPCsharp provides Neumaier more accurate floating point summation algorithm, and implements it in SIMD/SSE on a single core, as well as multi-core.

## Where to Find These .Sum() Improvements

HPCsharp is a free nuget package, implementing all of the above .Sum() algorithms and can be downloaded from http://www.nuget.org

## What is not Supported, Yet

Linq .Sum() supports nullable arrays for int, long, float, double and decimal. HPCsharp does not support nullable arrays, yet. Let me know if you need these implemented and accelerated. In the mean time use Linq’s .Sum() in this case, and don’t forget to use .Sum().AsParallel() to use all of the processor cores, instead of just one.

Linq .Sum() supports adding up a field within a user defined type/class. HPCsharp does not support this capability, yet, but it shouldn’t be difficult to add.