Parallel Divide-and-Conquer in C#

C# has several fundamental abstractions that make parallel programming possible. By splitting work across multiple processor cores, gains in performance can be quite significant. And, with AMD and Intel shipping processors with more and more cores, using all of these compute resources is critical.

Some of C# parallel facilities are:

  • Parallel.Invoke() which runs several functions in parallel on multiple cores
  • Parallel.ForEach() which processes a container in parallel on multiple cores

One key missing abstraction is the parallel Divide-and-Conquer. I’ve added it to the HPCsharp library (free nuget package on nuget.org), to make this abstraction available to C# developers. In this blog, I’ll describe the capabilities and simplicity of this Divide-and-Conquer abstraction. Examples will show the ease of use and demonstrate several variations available.

Divide-and-Conquer

Divide-and-Conquer is a common pattern in algorithms, where the problem is broken into smaller pieces, with each piece producing a result. These results are then combined, to produce the final answer. Merge Sort is one of the most common examples for the use of Divide-and-conquer pattern.

Divide-and-Conquer is also one of the key patterns for parallel programming, where as the problem is broken into smaller pieces, each piece runs on a separate computing core. Each core produces a result for its own smaller piece of the overall problem. Then these results are combined to produce the final answer.

Example

As an example, let’s say we have a million double-precision floating-point numbers (doubles in C#) to add up, to produce a single sum result. This can be easily done with a single for loop:

public static double SumHpc(this double[] arrayToSum, int startIndex, int length)
{
    int endIndex = startIndex + length;
    double overallSum = 0;
    for (int i = startIndex; i < endIndex; i++)
       overallSum += arrayToSum[i];
    return overallSum;
}

The above method runs at 0.96 GigaAdd/sec – almost a billion additions per second – but uses only a single core out of six available on my laptop.

Divide-and-conquer algorithm, would break the million element array into two halves, with each have producing its own sum. Then the two sums, would be combined into a single sum, to produce the final answer. Each half of the array can be processed in parallel, independently summing up all the elements within their own half of the array. Each half could run on a separate computational core, using two cores. But, we have six cores available.

Divide-and-Conquer algorithm doesn’t have to stop with splitting the array into two halves. It can continue to break each half into two halves, and then those halves into two halves and so on, by using recursion. At some point recursion needs to stop and terminate. A good point for termination is when the pieces get small enough where the amount of work is small enough to not be worthwhile splitting up any more. In this case, this small amount of work needs to be done on a single core, using the algorithm shown above.

By continuously (recursively) splitting the array into smaller and smaller pieces, there will be enough pieces to keep all compute cores busy computing sums. This recursion creates a tree, where at the leaf nodes the work of summing is done on each core. Once summing is done, then each sum result is combined with others, which is called a reduce operation. These results are then combined up the tree, continuously reducing the number of results, until there is a single sum result.

The above Divide-and-Conquer method has been implemented as a generic function in HPCsharp, providing the capability for parallel and serial usage. These generic functions are in HPCsharp.AlgorithmPatterns namespace.

Example Usage

To implement the above sum, as a parallel algorithm running on multiple cores, HPCsharp parallel Divide-and-Conquer generic function can be used, as shown:

double sum = DivideAndConquerPar(arrayToSum, startIndex, length, 
                SumHpc, (x, y) => x + y, 16384);

That’s it! That’s all it takes.

The above function runs the summation algorithm on all the cores within a multi-core processor, accelerating performance. For the same array of doubles, this version runs at 3.7 GigaAdds/sec on a six-core processor.

This DivideAndConquerPar function takes a few arguments:

  • arrayToSum is the array of doubles to be summed up
  • ¬†startIndex is the index of the first element within arrayToSum to be summed up
  • length is the number of elements to be summed up
  • SumHpc is the base-case serial single-core function – the first function shown above in this blog
  • (x, y) => x + y is the lambda reduce function, which combines the sum results from two pieces of the summation
  • 16384 is the parallel threshold, which is the number of elements below which the serial algorithm (such as SumHpc above) will be used, running on a single core

Another way to think about the last argument – it’s the parallel work quanta – the smallest amount of work worthwhile for each of the cores to do on their own. If this work quanta is too small, then the time it takes to split the array and to combine the result will produce too much overhead and will reduce the overall performance. However, if the work quanta is too large, the amount of parallelism will be reduced. One reasonable value to keep in mind, recommended by Intel, is the work quanta should be no less than 16K CPU clock cycles.

C# Parallel Alternatives

C# has a nice method for this for arrays, called .Sum(). This method also has a built-in parallel version, which can be called by using yourArray.AsParallel().Sum() function. This method is quite a bit slower than what I’ve shown above, since HPCsharp provides SIMD/SSE accelerated .Sum(), to make better use of each computational core, as well as running on all the cores.

Other Algorithms and Variations

HPCsharp Divide-and-Conquer functions are generic and can be used for many data types and many other algorithms beside .Sum(). For example, they can be used to compute minimum or maximum value of an array, or …

Divide-and-Conquer functions come in several variations:

  • Serial and Parallel version: DivideAndConquer and DivideAndConquerPar
  • Single type and TwoTypes:
    • the same data type is used for arrays, baseCase and reduce functions
    • one data type is used for the array, and a second data type is used for the return value from the baseCase and all types within the reduce function

The TwoTypes function variation is very useful for cases where larger precision is needed when combining results from each of the computational pieces.

All of these variations are used inside HPCsharp library itself.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s