Faster C++ Sorting

C++ has included wonderful implementations of sorting algorithms over the years. Recently, with C++17 support for parallelism, sorting performance has skyrocketed by running on all of the available cores. The number of cores is predicted to grow in double-digit percentage per year, as competition between Intel, AMD and ARM heats up. This core race will give these parallel algorithms the power to give us even faster performance.

Not only is the number of cores going up, but along with it, memory bandwidth is being scaled. From two channels in laptops and desktops, to four, six and eight channels in workstations. Some dual-socket CPU designs double even that. Scaling of the number of computational cores along with scaling of bandwidth, plays right into faster parallel algorithms, one of which is parallel sorting.

I put together an open source repo to show how standard C++17 sorting performance scales with the number of CPU cores. The same code runs on Windows and Ubuntu 20.04 Linux, which supports the latest g++ with C++17. A VisualStudio 2019 solution is included. Instructions on how to build on Ubuntu 20.04 Linux are provided in the project Readme.

This repo also includes several sorting algorithms, which I have developed over the years, starting from my Dr. Dobb’s column on Parallel Algorithms. To start with, Parallel Merge Sort and LSD Radix Sort are there, with many more to follow, along with optimizations developed in HPCsharp C# nuget package.

C++17 Parallel Sort Performance

Benchmark results of the following C++17 algorithms on Ubuntu 20.04 using g++, sorting an array of 10 Million unsigned integers:

  • single core: sort(sorted.begin(), sorted.end())
  • multi-core: sort(std::execution::par_unseq, sorted.begin(), sorted.end())
Algorithm Random Presorted Description
sort single core 14 72 48-core Intel Xeon, with hyperthreading (96 vCPUs)
sort multi-core 625 1,333 48-core Intel Xeon, with hyperthreading (96 vCPUs)

This is the first time I’ve ever seen C++ sorting on CPUs above 1 Giga array elements per second! Kudos to the C++17 algorithm developers for such a fantastic job! It’s also scaling very well with the number of cores and number of memory channels.

Performance of sorting an array of unsigned long integers:

Algorithm Random Presorted Description
sort single core 14 16 48-core Intel Xeon, with hyperthreading (96 vCPUs)
sort multi-core 401 513 48-core Intel Xeon, with hyperthreading (96 vCPUs)

Can’t wait until machines with dual-socket 64-core each become available.

Sadly, the ARM-based machine in AWS are running an older version of Linux, which doesn’t support C++17. It would have been fun to compare performance on these.

Performance of My Algorithms

Algorithm OS Compiler Cores Type Random Pre-Sorted Mega/sec
C++17 sort Windows Intel 1 double Yes 11
C++17 sort Windows Intel 6 double Yes 61
Parallel Merge Sort Windows Intel 6 double Yes 76
C++17 sort Windows Intel 1 unsigned long Yes 11
C++17 sort Windows Intel 6 unsigned long Yes 61
Parallel Merge Sort Windows Intel 6 unsigned long Yes 81
LSD Radix Sort Windows Intel 1 unsigned long Yes 145
C++17 sort Windows Intel 1 unsigned Yes 11
C++17 sort Windows Intel 6 unsigned Yes 65
Parallel Merge Sort Windows Intel 6 unsigned Yes 85
C++17 sort Windows Intel 1 unsigned Yes 37
C++17 sort Windows Intel 6 unsigned Yes 175
Parallel Merge Sort Windows Intel 6 unsigned Yes 213

One thought on “Faster C++ Sorting

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s