My earlier Faster Sorting in C# blog described a Parallel Merge Sort algorithm, which scaled well from 4-cores to 26-cores, running from 4X faster to 20X faster respectively than the standard C# Linq.AsParallel().OrderBy. In this blog, I’ll describe an even faster Parallel Merge Sort implementation – by another 2X.
Performance of the New Approach
C# Array.Sort is fast, is in-place, but is limited to only a single core, as it has no parallel version. Linq.AsParallel().OrderBy is slower, is not in-place, and does not scale well as the number of cores increases. The table below shows performance when sorting an array of integers (Int32’s to be precise), using either a single core of the AMD EPYC processor or all 32 cores or all 48 cores (with hyperthreading):
|Algorithm||Number of Cores||Millions Int32 per second||Number of Cores||Millions of Int32 per second|
HPCsharp Parallel Merge Sort is 40X faster on 32-cores and over 50X faster on 48-cores!
For a full set of benchmark tables, see the HPCsharp Readme (Merge Sort Section).
The above graph compares 32-core performance for arrays of unsigned integers of HPCsharp Parallel Merge Sort algorithm versus Linq’s parallel (AsParallel) OrderBy function. The horizontal axis is the length of the array. The vertical axis is the performance in integers per second – the higher number the higher the performance.
HPCsharp Parallel Merge Sort is recursive itself, breaking the problem down into two halves, which run in parallel, sorting each half. Then the sorted halves are merged together leveraging the amazing scalability and performance of the Parallel Merge algorithm. This algorithm is also recursive and parallelizes very well. Thus, every part of the overall algorithm is built to scale well on parallel processors.
Plus, merge is an algorithm that is very friendly to cache and system memory accesses. At the bottom of the recursion, the base case accesses memory in sequential order, which is as good as it gets for high performance.
At the bottom of the Merge Sort algorithm, a hybrid approach is used – Array.Sort is used for the base case. This algorithm is based on Introspective Sort, which runs well on a single core, and is also known for being cache and memory access pattern friendly.
Scaling Small and Large
I’ve been working on improving scaling lately. The problem, as one developer at a talk I gave pointed out, is that these parallel algorithms work great for large problems, but how do you make them work for small problems as well? A possible simple solution is to ramp up to parallelism by defining the quanta of work that is large enough to see a performance gain from more than one worker/core. In other words, the amount of work smaller than this needs only a single worker/core. Any larger amount of work would benefit from more than one worker. As the amount of work grows, so is the number of workers (cores) applied. This idea is implemented in the HPCsharp Parallel Merge Sort.
This is a simplistic and effective way of providing scaling for the case where the amount of work is small. But, we also need to optimize how larger amounts of work are handled, to minimize the overhead from performing work in parallel by multiple workers. When we know the number of workers, like the number of cores in a processor, we can simply divide the total amount of work (e.g. the array) into equal number of pieces for each core (worker) to work on. For example, divide the array to be sorted by the number of cores in a processor, sorting each sub-array using each core, and merging the results in parallel. This idea is also implemented in the HPCsharp Parallel Merge Sort.
Benchmarks verified that splitting the array into smaller or larger sub-arrays resulted in performance degradation. It seems like splitting the total work, by the number of workers available, minimized the parallel overhead and balanced the load across the available workers well, in the case of the Parallel Merge Sort.
The above idea worked very well when applied to the Parallel Merge Sort algorithm, using Array.Sort as the base case of recursion. Parallel Merge was then used to merge the sorted sub-arrays into a single sorted full array, in parallel
Basically, HPCsharp Parallel Merge Sort is a parallel version of Array.Sort()!
This algorithm has been released as part of HPCsharp free and open source nuget package, providing the highest performance generic sorting performance of nearly 300 Million Int32’s per second, and over 40X faster than Linq.AsParallel.OrderBy().
For this enormous performance gain, something had to give. In a classic computer science trade-off, we gave up space for a gain in time. The resulting high performance Parallel Merge Sort algorithm is now not in-place, while Array.Sort is in-place. Relative to Linq.AsParallel.OrderBy, this isn’t bad, as this algorithm is also not in-place. But versus Array.Sort we ended up giving up space – i.e. more memory is needed, since the algorithm allocates an array of the same size as the input array for a working buffer.
Not a bad trade-off of 2X larger in space for much more smaller amount of time used to do exactly the same sorting operation!
Comparing to C++
C++ has recently, in the C++17 standard, significantly upped its multi-core game. Comparing C# performance to C++ is always fun, and for parallel algorithms is no exception. I’ve updated my “Faster C++ Sorting” blog entry https://duvanenko.tech.blog/2020/02/03/faster-c-sorting/ with some recent benchmarks.