C++ has included wonderful implementations of sorting algorithms over the years. Recently, with C++17 support for parallelism, sorting performance has skyrocketed by running on all of the available cores. The number of cores is predicted to grow in double-digit percentage per year, as competition between Intel, AMD, ARM and other processor vendors heats up. The core race will give these parallel algorithms the power to give us even faster performance.
Not only is the number of cores going up, but along with it, memory bandwidth is being scaled. From two channels in laptops and desktops, to four, six and eight channels in workstations. Some dual-socket CPU designs double even that. Scaling of the number of computational cores along with scaling of bandwidth, plays right into faster parallel algorithms, one of which is parallel sorting.
I put together a free open source repo to show how standard C++17 sorting performance scales with the number of CPU cores. The same code runs on Windows and Ubuntu 20.04 Linux, which supports the latest g++ with C++17. A VisualStudio 2019 solution is included, using the Microsoft compiler and Intel’s OneAPI compiler. Instructions on how to build on Ubuntu 20.04 Linux are provided in the project Readme.
This repo also includes several sorting algorithms, which I have developed over the years, starting from my Dr. Dobb’s column on Parallel Algorithms. Algorithms such as Parallel Merge Sort, LSD Radix Sort, Parallel LSD Radix Sort, and Parallel In-Place Merge Sort are included, with more to follow, along with optimizations developed in https://www.nuget.org/packages/HPCsharp (C# nuget package, called HPCSharp).
C++17 Parallel Sort Performance
Benchmark results of the following C++17 algorithms on Ubuntu 20.04 using g++, sorting an array of 10 Million unsigned long integers (32-bits each):
- single core:
sort(sorted.begin(), sorted.end())
- multi-core:
sort(std::execution::par_unseq, sorted.begin(), sorted.end())
Algorithm | Random | Presorted | Description |
sort single core | 11 | 32 | 48-core Intel Xeon, with hyperthreading (96 vCPUs) |
sort multi-core | 93 | 250 | 48-core Intel Xeon, with hyperthreading (96 vCPUs) |
The latest performance benchmarks are also listed in (https://github.com/DragonSpit/ParallelAlgorithms). Scaling by a bit over 8X when running on a 48-core processor versus running on a single core is a bit disappointing.
A Higher Performance Parallel Algorithm
A higher performance sorting algorithm is provided in my open source repo. Parallel Merge Sort is generic, comparison based algorithm, with performance shown in the following table:
Algorithm |
Random |
Presorted |
Constant |
Computer |
Merge Sort single core |
13 |
109 |
169 |
6-core Intel i7-9750H, with hyperthreading |
Merge Sort multi-core |
108 |
261 |
285 |
6-core Intel i7-9750H, with hyperthreading |
Merge Sort single core |
12 |
100 |
148 |
48-core Intel Xeon, with hyperthreading (96 vCPUs) |
Merge Sort multi-core |
565 |
850 |
1064 |
48-core Intel Xeon, with hyperthreading (96 vCPUs) |
This free library repo is provided as headers, similar to STL – i.e. there is no library to link to – you include the source code instead. Parallel Merge Sort, provided in the repo above, scales much better, reaching substantially higher performance than the standard C++ multi-core sort implementation.
Simple interfaces are provided, which are similar to the standard C++ sort. The interface is not quite as advanced as the one provided by standard C++, supporting only generic arrays and vectors, since I lack the skill of abstracting generic iterators properly.
While the algorithm is not in-place, some of the interfaces provided are, allocating a working array temporarily if enough memory is available. If insufficient memory is available, then the standard C++ sort is used, which is in-place, thus, providing an adaptive interface, like STL does with some of its algorithms, favoring performance when space is available, providing a classic Computer Science space-time trade-off.
The above high performance Parallel Merge Sort algorithm is a hybrid algorithm of Parallel Merge Sort using Parallel Merge at the top of the recursive method, with C++ sort at the bottom of recursion.
Sadly, the ARM-based machine in AWS are running an older version of Linux, which doesn’t support C++17. It would have been fun to compare performance on these.
Examples
Examples of usage for the Parallel Merge Sort and Parallel LSD Radix Sort, as well as standard C++ sort algorithms are provided in the repo, for VisualStudio 2019, g++, and Intel compiler, for Windows 10 and Linux. You can build and run these examples to see how performance compares on your target computer.
Resources
The Parallel Merge Sort algorithm is based on the algorithm described in “Introduction to Algorithms” 3rd edition by Cormen, Leiserson, Rivest and Stein (Chapter 27.3, pp. 797-810), with early termination to Insertion Sort.
C++17 blog https://www.bfilipek.com/2018/11/parallel-alg-perf.html
Microsoft C+17 blog https://devblogs.microsoft.com/cppblog/using-c17-parallel-algorithms-for-better-performance/
HPCsharp Nuget Package for C# : https://www.nuget.org/packages/HPCsharp
2 thoughts on “Faster C++ Sorting”