In the initial blog I developed a faster *checked* summation of *ulong* arrays. By using integer operations when no arithmetic overflow is detected, helped propel performance 15X higher to 682 MegaUlongAdditions/second, for a typical case. However, the worst case was still as slow as summation of *Decimal* arrays – 48 MegaUlongAdditions/second. In this blog I’ll develop a method to raise this worst case to the same level as the typical case and then even higher.

### The Main Idea

*Decimal* data type in C# is a wonderful thing. Precision is extended to 96-bits, versus 64-bit ulong. *Decimal* handles positive and negative integers, as well as fractions. However, *Decimal* is not supported natively by the processor, whereas integers and floating-point numbers are. *Decimal* addition is about 20X slower. *BigInteger* addition is even worse, at about 100X slower.

The main idea behind improving performance of *ulong* array summation is to stop using Decimal addition in the inner loop. Instead, I’ll extend integer precision beyond 64-bit *ulong* to 96-bits. By using an extra 32-bit unsigned integer to accumulate each overflow that is detected, all computations stay as integer computations – much faster since integers are natively supported by the processor.

Below is the extended integer only summation:

public static decimal SumToDecimalEvenFaster(this ulong[] arrayToSum) { decimal overallSum = 0; ulong ulongSum = 0; uint uintUpperSum = 0; // together uintUpperSum and ulongSum represent a 96-bit uint for (int i = 0; i < arrayToSum.Length; i++) { ulong newUlongSum = ulongSum + arrayToSum[i]; if (newUlongSum < ulongSum) uintUpperSum++; // carry-out of lower 64-bit into carry-in of upper 32-bits ulongSum = newUlongSum; } decimal multiplier = 0x8000_0000_0000_0000; overallSum = multiplier * (Decimal)2 * (Decimal)uintUpperSum; // uintUpperSum << 64 overallSum += ulongSum; return overallSum; }

This algorithm is implemented more like it came from a chip designer, than a software developer. In the inner loop only integer operations are performed. Arithmetic overflow is detected and dealt with inside the inner loop. If arithmetic overflow is detected than the uintUpperSum is incremented.

Another way of looking at it is, when arithmetic overflow is detected a carry-out is generated from the 64-bit addition, which serves as carry-in to the 32-bit integer extension. Now, I’m talking like a hardware designer. Since C# arrays can only hold 2 GigaElements, 32-bit unsigned integer extension is more than sufficient to handle any size array. If C# ever extends its capabilities to larger arrays, then this 32-bit extension can easily be extended to a 64-bit ulong.

### Performance

In the worst case of every array element being *UInt32.MaxValue*, the above implementation on a single core, runs at 1.2 GigaUlongAdditions/second instead of 48 MegaUlongAdditions/second of using a Decimal code below:

Decimal sum = ulongArray.Sum(x => (decimal)x);

This is a 25X speed improvement!

Notice that performance of 1.2 Giga/sec, in the worst case, is nearly 2X higher than the typical/best case of the previous algorithm’s 682 Mega. Every case benefited from this new integer-only algorithm.

This algorithm can also be translated to use SIMD/SSE instructions, which raises performance to 1.7 GigaUlongAdditions/second on a single core.

This is a 35X speed improvement!

On a quad-core laptop CPU with dual-memory channel system memory, performance goes up to 2.7 GigaUlongAdditions/second

This is a 26X speed improvement!

Decimal sum = ulongArray.AsParallel().Sum(x => (decimal)x);

When compared to the code above. Note *.AsParallel()* to run on all four cores – running at 102 MegaUlongAdditions/second.

It’s nice to run in Giga instead of Mega!

HPCsharp nuget package includes this new higher performance algorithm, including SIMD/SSE multi-core version.