Ramping up Parallelism

The number of cores in modern processor is growing. For quite a while we were stuck at 2 or 4 cores. But, lately the number of cores have grown to 6 and 8 even for consumer processors, and to 14 cores for Intel Xeon workstation and cloud CPUs, and 32 cores for AMD desktop and cloud processors.

It’s a common practice for multi-core programming to have a threshold for when to switch between single core code and multi-core code. This threshold is for the amount of work that is large enough for multiple cores to be used and to provide performance gain. However, this single threshold isn’t sufficient for 2-core processor and 32-core processor. We need a way that handles even hundreds and thousands of cores.

One possible way is to determine the minimal amount of work, such as array size, that is big enough for two cores to provide a speedup over a single core. Think of it as the amount of work which is large enough for two workers to be faster than a single worker.

The same idea can then be scaled to more than two workers. As the amount of work grows, more workers are used, when the result is faster than doing the work with fewer workers. For example, when N items are more efficiently processed by two workers than a single worker, with each worker processing N/2 items, then for 3 workers we would need 1.5N items, and for 4 workers 2N items, and so on. In other words, more workers are brought on as the amount of work grows, since there is sufficient amount of work to support a certain number of workers.

This methodology ramps up the number of workers/cores when there is sufficient amount of work to utilize all of the workers/cores to provide a performance gain. As the amount of work grows, so does the number of workers/cores, until all the cores are used.

This ramp up methodology is currently being implemented and tested in HPsharp C# nuget package and should be available soon, in one parallel function to start with, to provide support for small and large amounts of work, which scale in performance smoothly.

Too Little Work For Too Many Workers

The easiest parallelism approach for C# developers to use is to add .AsParallel().Sort() or .AsParallel().ToArray(), which throws all of the cores at the problem, no matter the size of the operation. Here is an example of what happens to List.AsParallel().ToArray() on a 6-core CPU.

!!! Add Graph Here showing slow down in performance!!!

As seen in the benchmark graph above, performance suffers significantly when parallelism is applied to too small of a problem. In this case, the problem is too small to be processed efficiently by all of the cores. Kinda like too many workers with too little work to do. The overhead of each worker/core is hurting performance. It is better to use fewer workers/cores.

The same issue occurs in HPCsharp for List.CopyTo(), which copies a List of Int32’s to an existing array, as shown by two benchmarks: one running on a 14-core Xeon processor, and another on a 6-core laptop processor

14CoreParallelListToArrayCopy

The vertical axis indicates the speed-up obtained when using multi-core List.CopyToPar() of HPCsharp nuget package, comparing to a single core. Greater than 1 means faster, and less than 1 is slower. This benchmark shows several interesting observations:

  • List size must be 256K or larger to benefit from parallelism, for this copy operation, even for two cores
  • Small List sizes experience severe performance degradation, 64 element List is about 500X slower as a multi-core copy, versus a single core copy
  • Two, three, or even four cores are not sufficient to use all of the available memory bandwidth on the Xeon processor with 4-channels of memory. 7 cores are needed to reach the peak performance.
  • Using more than 7 cores does not result in performance increase, and is wasteful, causing a slight performance degradation, most likely due to contention.
  • 3X performance increase is possible for List.CopyToPar() by using 7 cores, and 4 Million or bigger List sizes.

6CoreParallelListToArrayCopy

Benchmark on a 6-core laptop processor shows:

  • List larger than 128K is needed to show performance gain for parallel copy
  • Using more than 2 cores does not show gains in performance, on this two memory channel processor
  • Performance also peaks at about 4 Million List size
  • 40% performance gain is possible for a parallel copy
  • Small arrays cause severe performance degradation for parallel copy

As developers, we don’t want to think about, “Is this case too small for parallelism?” We want the algorithm to scale parallelism as it needs to and do it efficiently thru a simple interface.

 

2 thoughts on “Ramping up Parallelism

  1. The analysis is too simplistic. In general the relationship of throughput to cores is linear only at low number of cores. As the number of cores (sharing the same computer memory) increases, they begin to interfere with each other (due to memory data accesses). This means that as N increases the incremental benefit of adding more cores decreases and at some point adding more cores actually hurts.

    My experience shows that the performance gain of parallelism is an upside down parabola. At low cores it is linear going up. It begins to flatten at some level of parallelism and then actually declines.

    Like

  2. Good feedback David. This blog is just the beginning, describing a method that can use more sophistication. It can definitely use your insight into variety of algorithms, and needs to be taken on case-by-case basis, generalizing to a curve that flattens out and most likely drops when contention leads to reduce in slope and then drop in performance.

    For a few algorithms that I’m working with, the linear model will work adequately, as the main optimization these need is around small sized array, where always using all of the cores leads to poor performance. On the larger array size, using too many cores, leads to less performance loss, but is wasteful, possibly from power point of view. This is exactly the kind of discussion and design I’ve been hoping to get to, with some real measurements to back them up.

    We’ll both have to think about how to make these kinds of curves of performance behavior for various parallel algorithms available so that the algorithms can use these curves to determine the optimal core usage for the size of the problem.

    One result I’m seeing with memory-bound algorithms is that it just about never pays to have the degree of parallelism be higher than the number of physical cores. Another important factor is the number of memory channels, obviously for memory-bound algorithms, as you’ll see from some of the benchmarks I’ll publish shortly.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s