C# Linq provides a convenient way to add up all of the array elements.

var arr = new int[] { 5, 7, 16, 3 }; int arrSum = arr.Sum();

This *.Sum()* supports some of the built-in numeric types: *int*, *long*, *float*, *double*, and *decimal*. It provides a nicely consistent and convenient way of adding up all of the values within an array.

For each data type, C# *.Sum()* produces result of the same data type. For instance, when adding up an array of *int* values, the sum is also an *int* value, as seen in the code above. The *.Sum()* throws an overflow exception whenever adding all of the array elements exceeds the maximum value of that numeric data type – *Int32.MaxValue* in the above example.

### The Overflow Problem

For *int[]* this overflow exception occurs much more often than expected. For instance, when adding up random int values that are over a full positive range of *int*, overflow happens quite often. Overflow can even happen with an array of just two elements:

var arr = new int[] { Int32.MaxValue, 1 }; int arrSum = arr.Sum();

This issue makes integer array *.Sum()* problematic and difficult to use. Dealing with overflow exceptions complicates this simple task of adding some numbers.

### Eliminating Overflow

It’s possible to eliminate overflow for *.Sum()* of integer arrays. By using a larger data type for the accumulator during the *.Sum()* all of the bits can be preserved. For instance, a *long*, which is 64-bits in size can be used as the accumulator and be returned as the summation result.

In C#, array index is limited to an *int* data type, which is 32-bits. Thus, when accumulating an array of *int’s*, we would need 32-bits plus 32-bit – i.e. 64-bits total – in the accumulator, to avoid overflow. A long, which is 64-bits, provides just enough bits to avoid overflow entirely.

The *.Sum()* implementation in the HPCsharp nuget package uses a *long* integer data type as an internal accumulator and returns a *long* integer summation result. This method can never throw an overflow exception.

### Supporting Unsigned and Other Numeric Types

C# Linq *.Sum()* supports only a limited set of signed integer data types: *int* and *long*. Other C# signed integer data types, such as *sbyte* and *short*, along with unsigned integer data types, such as *byte*, *ushort*, *uint*, and *ulong*, are not supported.

HPCsharp adds implementation of *.Sum()* for all of these integer data types: signed and unsigned. For arrays of unsigned integers, such as *byte*, *ushort*, *uint* and *ulong*, *.Sum()* returns a *ulong*, which is the largest unsigned integer data type available. For signed integer data types of *sbyte* and *short*, HPCsharp *.Sum()* returns a *long*, which is the largest signed integer data type.

For 8-bit, 16-bit and 32-bit integer data types, HPCsharp avoids overflow exception by design. However, Linq *.Sum()* of long arrays, and HPCsharp *.Sum()* of long and ulong arrays, can still throw an overflow exception, because there is no larger integer data type available.

### Eliminating Overflow Exception for Long and ULong

*Long* and *Ulong* are 64-bit and are the largest integer data types. When using a *long* accumulator during *.Sum()* overflow exception is possible and likely in the case of accumulating random 64-bit values. If a *double*, which has 53 bits of precision, is used as an accumulator, then 11 bits could potentially be lost, making *double* less than an ideal choice.

Ah, but C# has another numeric data type, which has 96 bits of precision – *decimal*. *Decimal* provides just enough bits for accumulating *long* or *ulong* integer arrays. Since *long* and *ulong* are 64 bits and C# indexes are 32-bits, an accumulator would need 96 bits of precision to avoid overflow. The *decimal* data type provides exactly 96 bit of precision, making it a wonderful choice, giving C# an advantage over other languages.

While accuracy of *.Sum()* for *long* and *ulong*, when using a *decimal* accumulator, will be perfect, the performance will be much slower, since modern processors do not support the decimal data type in hardware. All of the arithmetic operations on the *decimal* data type are implemented in software.

HPCsharp provides these more accurate *.Sum()* functions for *long* and *ulong* integer arrays. Performance is 25% faster than Linq *.Sum()* of *decimal* arrays and uses half the memory. Single-core and multi-core versions are available.

### Data Parallel on Single Core

Adding integers and floating-point numbers is a simple operation supported by SIMD/SSE data parallel instructions on most modern processors, such as Intel, AMD and ARM. Some processors support up to 512-bits wide in parallel, which means sixteen 32-bit additions in parallel in a single CPU clock cycle, while running at 3 to 4 GHz.

HPCsharp takes advantage of these SIMD/SSE data parallel instructions to speed up *.Sum()* for all of the numeric data types, except *decimal*, to provide even faster performance. On a single core this leads to more than 20X performance boost over Linq *.Sum()* for integer arrays. Similar speedups are achieved for other data types.

The following table shows summation in GigaElements/second:

Lib |
sbyte |
byte |
short |
ushort |
int |
uint |
long |
ulong |
float |
double |
decimal |
---|---|---|---|---|---|---|---|---|---|---|---|

Linq | n/a | n/a | n/a | n/a | 0.15* | n/a | 0.16* | n/a | 0.17 | 0.17 | 0.03 |

HPCsharp | 2.8 | 3.0 | 3.2 | 3.2 | 3.3 | 3.3 | 2.0* | 2.0* | 3.7 | 1.9 | 0.14 |

### Parallel on Multiple Cores

Modern processors are finally adding more cores beyond 4. Recently, AMD has released a desktop processor with 32 cores, with hyperthreading. Intel has similar ones in the works. By using multiple cores, it’s possible to accelerate .Sum() further. On my quad-core hyperthreaded laptop, this provides 6X speedup over Linq *.Sum().AsParallel()*, providing beyond 5 GigaInt/sec summation performance, as seen in the following table:

Lib | sbyte | byte | short | ushort | int | uint | long | ulong | float | double | decimal | BigInteger |

Linq | n/a | n/a | n/a | n/a | 0.9* | n/a | 0.9* | n/a | 0.9 | 0.9 | 0.12 | 0.11** |

HPC# | 7.6 | 8 | 8 | 8.2 | 5 | 5.3 | 2.9* | 2.8* | 5.1 | 2.9 | 0.14 | 0.036 |

The above table show performance of Linq and HPCsharp .Sum() for arrays of various numeric data types in Billions/second, running on quad-core and using SIMD/SSE instructions. The ones marked with * can throw an overflow exception, and n/a means not available. For ** Linq doesn’t implement BigInteger.Sum(), used .Aggregate() instead, which doesn’t speed-up with .AsParallel() – i.e. doesn’t seem to support multi-core.

### More Accurate Floating-Point *.Sum()*

To provide more accuracy for *.Sum()* for float arrays, a double could be used as an accumulator. This more accurate *.Sum()* version is implemented in HPCsharp. However, for doubles, no higher accuracy data type is available.

Floating-point has unique issues due to the large dynamic range supported. For instance, when adding a very large number with a very small number, the result may be surprising, where the sum ignores the small number entirely.

double[] arrDouble = new double[] { 1, 10.0e100, 1, -10e100 }; double arrSum = arrDouble.Sum();

The *.Sum()* in the above code using Linq *.Sum()* returns a zero, because the two large values overwhelm the two 1.0 values, ignoring these small values entirely.

Two algorithms have been developed to deal with this situation, and to provide a more accurate result: Kahan and its Neumaier variation. These algorithms go thru additional steps to avoid losing precision of the summation result when adding floating-point numbers. In the example above, the Neumaier sum algorithm produces the correct result of 2.

HPCsharp implements both of these more accurate algorithms for floating-point summation, for both float and double data types. Raising performance higher, Neumaier .Sum() is also implemented using SIMD/SSE instructions as well as multi-core.

Array |
Accumulator |
Scalar |
SIMD/SSE |
Scalar Multi-Core |
SIMD/SSE Multi-Core |

float[] | float | 0.6 | 2.7 | 1.9 | 4.8 |

float[] | double | 0.6 | 1.4 | 1.7 | 3.9 |

double[] | double | 0.6 | 1.3 | 1.6 | 2.5 |

The table above compares Neumaier .Sum() algorithm performance for scalar, SIMD/SSE single-core, with multi-core and SIMD/SSE/multi-core, in Giga/sec. It shows that using a double accumulator slows performance. It also shows that Neumaier more accurate .Sum() algorithm once implemented with SIMD/SSE and multi-core is only 15% slower than the standard lower accuracy algorithm – 4.8 GigaFloats/sec versus 5.1 GigaFloats/sec, and 2.5 GigaDoubles/sec versus 2.9 GigaDoubles/sec.

### Performance vs Accuracy Trade-off

HPCsharp also implements pairwise floating-point to provide higher accuracy than the naive for loop approach to summation, but not as accurate as the Kahan/Neumaier summation. For more details see the next blog (coming soon)…

### Better *.Sum()* in How Many Ways?

- speeds up by using SIMD/SSE instruction capabilities within each core of the processor
- expands support to all unsigned integer types and smaller signed integer types
- avoids throwing overflow exception by using a 64-bit accumulator for all integer data types
- avoids throwing overflow exception by using decimal when accumulating
*long[]*or*ulong[]* - adds Kahan more accurate floating-point summation algorithm, and implements it in SIMD/SSE on a single core, as well as multi-core

### Where to Find These *.Sum()* Improvements

HPCsharp is a free nuget package, implementing all of the above *.Sum()* algorithms and can be downloaded from http://www.nuget.org

See further improvements in the blog, “Better C# .Sum() in More Ways”

### What is not Supported, Yet

Linq *.Sum()* supports nullable arrays for *int*, *long*, *float*, *double* and *decimal*. HPCsharp does not support nullable arrays, yet. Let me know if you need these implemented and accelerated. In the mean time use Linq’s *.Sum()* in this case, and don’t forget to use *.Sum().AsParallel()* to use all of the processor cores, instead of just one.

Linq *.Sum()* supports adding up a field within a user defined type/class. HPCsharp does not support this capability, yet, but it shouldn’t be difficult to add.