The number of cores in modern processor is growing. For quite a while we were stuck at 2 or 4 cores. But, lately the number of cores have grown to 6 and 8 even for consumer processors, and to 14 cores for Intel Xeon workstation and cloud CPUs, and 32 cores for AMD desktop and cloud processors.
It’s a common practice for multi-core programming to have a threshold for when to switch between single core code and multi-core code. This threshold is for the amount of work that is large enough for multiple cores to be used and to provide performance gain. However, this single threshold isn’t sufficient for 2-core processor and 32-core processor. We need a way that handles even hundreds and thousands of cores.
One possible way is to determine the minimal amount of work, such as array size, that is big enough for two cores to provide a speedup over a single core. Think of it as the amount of work which is large enough for two workers to be faster than a single worker. The same idea can then be scaled to more than two workers. As the amount of work grows, more workers are brought in to help, when the result is faster than doing the work with fewer workers.
This methodology ramps up the number of workers/cores when there is sufficient amount of work to utilize these workers/cores to provide a performance gain.
This ramp up methodology has been implemented and tested in the HPsharp C# nuget package and is available now. It is used effectively in the Parallel Merge Sort functions (SortMergePar), which provides a smooth ramp from using a single core to using all of the cores, as the array size grows, scaling performance smoothly and always performing better than a single-core algorithm.
Too Little Work For Too Many Workers
The easiest parallelism approach for C# developers to use is to add .AsParallel().Sort() or .AsParallel().ToArray(), which throws all of the cores at the problem, no matter the size of the operation. Here is an example of what happens to List.AsParallel().ToArray() on a 6-core CPU.
!!! Add Graph Here showing slow down in performance!!!
As seen in the benchmark graph above, performance suffers significantly when parallelism is applied to too small of a problem. In this case, the problem is too small to be processed efficiently by all of the cores. Kinda like too many workers with too little work to do. The overhead of each worker/core is hurting performance. It is better to use fewer workers/cores.
The same issue occurs in HPCsharp for List.CopyTo(), which copies a List of Int32’s to an existing array, as shown by two benchmarks: one running on a 14-core Xeon processor, and another on a 6-core laptop processor
The vertical axis indicates the speed-up obtained when using multi-core List.CopyToPar() of HPCsharp nuget package, comparing to a single core. Greater than 1 means faster, and less than 1 is slower. This benchmark shows several interesting observations:
- List size must be 256K or larger to benefit from parallelism, for this copy operation, even for two cores
- Small List sizes experience severe performance degradation, 64 element List is about 500X slower as a multi-core copy, versus a single core copy
- Two, three, or even four cores are not sufficient to use all of the available memory bandwidth on the Xeon processor with 4-channels of memory. 7 cores are needed to reach the peak performance.
- Using more than 7 cores does not result in performance increase, and is wasteful, causing a slight performance degradation, most likely due to contention.
- 3X performance increase is possible for List.CopyToPar() by using 7 cores, and 4 Million or bigger List sizes.
Benchmark on a 6-core laptop processor shows:
- List larger than 128K is needed to show performance gain for parallel copy
- Using more than 2 cores does not show gains in performance, on this two memory channel processor
- Performance also peaks at about 4 Million List size
- 40% performance gain is possible for a parallel copy
- Small arrays cause severe performance degradation for parallel copy
As developers, we don’t want to think about, “Is this case too small for parallelism?” We want the algorithm to scale parallelism as it needs to and do it efficiently thru a simple interface.
Efficiency for Large Arrays
At the other end of the spectrum, when arrays are very large, we’d like performance to be maximized. In this case, there is so much work that there is no question about needing all of the available workers. There is more than plenty of work to go around for everyone. In fact, we would love to have more workers, but we just don’t have any more available.
In this case, to minimize overhead of switching between workers, we can simply split the array into M chunks for M available workers. All of the workers get the same amount of work. This method has been benchmarked for the Parallel Merge Sort algorithm, resulting in peak performance.
When the array was broken into smaller chunks, performance went down. When fewer workers/cores were used, performance also decreased.
The above method is included in the Parallel Merge Sort algorithm implementation of the HPCsharp nuget package on http://www.nuget.org
The Parallel Merge Sort algorithm handles small array sizes efficiently by ramping up the number of cores/workers from one to the maximum available. It also handles large array sizes efficiently by splitting the array evenly among the workers, which minimizes overhead.
The above method of splitting the array into equal chunks for each core to process is as simple as it can be, following the KISS principle – “Keep it Simple Stupid” – requiring any additional complexity to have a justification.
However, when the array size is large enough, it won’t perform as well under the case where one of the cores gets interrupted to do something else for a while. In this case, other cores will get their work done and then will wait for the last core to finish its chunk.
A better way is to split the work into large enough chunks where the overhead cost of switching between workers/cores is negligible. This allows other workers to pick up the slack, in case if one or more workers/cores are not finished with their work yet.
This level of sophistication will be added to HPCsharp soon…