The number of cores in modern processor is growing. For quite a while we were stuck at 2 or 4 cores. But, lately the number of cores have grown to 6 and 8 even for consumer processors, and to 14 cores for Intel Xeon workstation and cloud CPUs, and 32 cores for AMD desktop and cloud processors.
It’s a common practice for multi-core programming to have a threshold for when to switch between single core code and multi-core code. This threshold is for the amount of work that is large enough for multiple cores to be used and to provide performance gain. However, this single threshold isn’t sufficient for 2-core processor and 32-core processor. We need a way that handles even hundreds and thousands of cores.
One possible way is to determine the minimal amount of work, such as array size, that is big enough for two cores to provide a speedup over a single core. Think of it as the amount of work which is large enough for two workers to be faster than a single worker.
The same idea can then be scaled to more than two workers. As the amount of work grows, more workers are used, when the result is faster than doing the work with fewer workers. For example, when N items are more efficiently processed by two workers than a single worker, with each worker processing N/2 items, then for 3 workers we would need 1.5N items, and for 4 workers 2N items, and so on. In other words, more workers are brought on as the amount of work grows, since there is sufficient amount of work to support a certain number of workers.
This methodology ramps up the number of workers/cores when there is sufficient amount of work to utilize all of the workers/cores to provide a performance gain. As the amount of work grows, so does the number of workers/cores, until all the cores are used.
This ramp up methodology is currently being implemented and tested in HPsharp C# nuget package and should be available soon, in one parallel function to start with, to provide support for small and large amounts of work, which scale in performance smoothly.
Too Little Work For Too Many Workers
The easiest parallelism approach for C# developers to use is to add .AsParallel().Sort() or .AsParallel().ToArray(), which throws all of the cores at the problem, no matter the size of the operation. Here is an example of what happens to List.AsParallel().ToArray() on a 6-core CPU.
!!! Add Graph Here showing slow down in performance!!!
As seen in the benchmark graph above, performance suffers significantly when parallelism is applied to too small of a problem. In this case, the problem is too small to be processed efficiently by all of the cores. Kinda like too many workers with too little work to do. The overhead of each worker/core is hurting performance. It is better to use fewer workers/cores.
The same issue occurs in HPCsharp for List.CopyTo(), which copies a List of Int32’s to an existing array, as shown by two benchmarks: one running on a 14-core Xeon processor, and another on a 6-core laptop processor
The vertical axis indicates the speed-up obtained when using multi-core List.CopyToPar() of HPCsharp nuget package, comparing to a single core. Greater than 1 means faster, and less than 1 is slower. This benchmark shows several interesting observations:
- List size must be 256K or larger to benefit from parallelism, for this copy operation, even for two cores
- Small List sizes experience severe performance degradation, 64 element List is about 500X slower as a multi-core copy, versus a single core copy
- Two, three, or even four cores are not sufficient to use all of the available memory bandwidth on the Xeon processor with 4-channels of memory. 7 cores are needed to reach the peak performance.
- Using more than 7 cores does not result in performance increase, and is wasteful, causing a slight performance degradation, most likely due to contention.
- 3X performance increase is possible for List.CopyToPar() by using 7 cores, and 4 Million or bigger List sizes.
Benchmark on a 6-core laptop processor shows:
- List larger than 128K is needed to show performance gain for parallel copy
- Using more than 2 cores does not show gains in performance, on this two memory channel processor
- Performance also peaks at about 4 Million List size
- 40% performance gain is possible for a parallel copy
- Small arrays cause severe performance degradation for parallel copy
As developers, we don’t want to think about, “Is this case too small for parallelism?” We want the algorithm to scale parallelism as it needs to and do it efficiently thru a simple interface.