C++ has included wonderful implementations of sorting algorithms over the years. Recently, with C++17 support for parallelism, sorting performance has skyrocketed by running on all of the available cores. The number of cores is predicted to grow in double-digit percentage per year, as competition between Intel, AMD and ARM heats up. This core race will give these parallel algorithms the power to give us even faster performance.
Not only is the number of cores going up, but along with it, memory bandwidth is being scaled. From two channels in laptops and desktops, to four, six and eight channels in workstations. Some dual-socket CPU designs double even that. Scaling of the number of computational cores along with scaling of bandwidth, plays right into faster parallel algorithms, one of which is parallel sorting.
I put together an open source repo to show how standard C++17 sorting performance scales with the number of CPU cores. The same code runs on Windows and Ubuntu 20.04 Linux, which supports the latest g++ with C++17. A VisualStudio 2019 solution is included. Instructions on how to build on Ubuntu 20.04 Linux are provided in the project Readme.
This repo also includes several sorting algorithms, which I have developed over the years, starting from my Dr. Dobb’s column on Parallel Algorithms. To start with, Parallel Merge Sort and LSD Radix Sort are there, with many more to follow, along with optimizations developed in HPCsharp C# nuget package.
C++17 Parallel Sort Performance
Benchmark results of the following C++17 algorithms on Ubuntu 20.04 using g++, sorting an array of 10 Million unsigned integers:
- single core:
sort(std::execution::par_unseq, sorted.begin(), sorted.end())
|sort single core||14||72||48-core Intel Xeon, with hyperthreading (96 vCPUs)|
|sort multi-core||625||1,333||48-core Intel Xeon, with hyperthreading (96 vCPUs)|
This is the first time I’ve ever seen C++ sorting on CPUs above 1 Giga array elements per second! Kudos to the C++17 algorithm developers for such a fantastic job! It’s also scaling very well with the number of cores and number of memory channels.
Performance of sorting an array of unsigned long integers:
|sort single core||14||16||48-core Intel Xeon, with hyperthreading (96 vCPUs)|
|sort multi-core||401||513||48-core Intel Xeon, with hyperthreading (96 vCPUs)|
Can’t wait until machines with dual-socket 64-core each become available.
Sadly, the ARM-based machine in AWS are running an older version of Linux, which doesn’t support C++17. It would have been fun to compare performance on these.
Performance of My Algorithms
|Parallel Merge Sort||Windows||Intel||6||double||Yes||76|
|C++17 sort||Windows||Intel||1||unsigned long||Yes||11|
|C++17 sort||Windows||Intel||6||unsigned long||Yes||61|
|Parallel Merge Sort||Windows||Intel||6||unsigned long||Yes||81|
|LSD Radix Sort||Windows||Intel||1||unsigned long||Yes||145|
|Parallel Merge Sort||Windows||Intel||6||unsigned||Yes||85|
|Parallel Merge Sort||Windows||Intel||6||unsigned||Yes||213|