Faster C++ Sorting

C++ has included wonderful implementations of sorting algorithms over the years. Recently, with C++17 support for parallelism, sorting performance has skyrocketed by running on all of the available cores. The number of cores is predicted to grow in double-digit percentage per year, as competition between Intel, AMD, ARM and other processor vendors heats up. The core race will give these parallel algorithms the power to give us even faster performance.

Not only is the number of cores going up, but along with it, memory bandwidth is being scaled. From two channels in laptops and desktops, to four, six and eight channels in workstations. Some dual-socket CPU designs double even that. Scaling of the number of computational cores along with scaling of bandwidth, plays right into faster parallel algorithms, one of which is parallel sorting.

I put together a free open source repo to show how standard C++17 sorting performance scales with the number of CPU cores. The same code runs on Windows and Ubuntu 20.04 Linux, which supports the latest g++ with C++17. A VisualStudio 2019 solution is included, using the Microsoft compiler and Intel’s OneAPI compiler. Instructions on how to build on Ubuntu 20.04 Linux are provided in the project Readme.

This repo also includes several sorting algorithms, which I have developed over the years, starting from my Dr. Dobb’s column on Parallel Algorithms. Algorithms such as Parallel Merge Sort, LSD Radix Sort, Parallel LSD Radix Sort, and Parallel In-Place Merge Sort are included, with more to follow, along with optimizations developed in https://www.nuget.org/packages/HPCsharp (C# nuget package, called HPCSharp).

C++17 Parallel Sort Performance

Benchmark results of the following C++17 algorithms on Ubuntu 20.04 using g++, sorting an array of 10 Million unsigned long integers (32-bits each):

  • single core: sort(sorted.begin(), sorted.end())
  • multi-core: sort(std::execution::par_unseq, sorted.begin(), sorted.end())
Algorithm Random Presorted Description
sort single core 11 32 48-core Intel Xeon, with hyperthreading (96 vCPUs)
sort multi-core 93 250 48-core Intel Xeon, with hyperthreading (96 vCPUs)

The latest performance benchmarks are also listed in (https://github.com/DragonSpit/ParallelAlgorithms). Scaling by a bit over 8X when running on a 48-core processor versus running on a single core is a bit disappointing.

A Higher Performance Parallel Algorithm

A higher performance sorting algorithm is provided in my open source repo. Parallel Merge Sort is generic, comparison based algorithm, with performance shown in the following table:

Algorithm

Random

Presorted

Constant

Computer

Merge Sort single core

13

109

169

6-core Intel i7-9750H, with hyperthreading

Merge Sort multi-core

108

261

285

6-core Intel i7-9750H, with hyperthreading

Merge Sort single core

12

100

148

48-core Intel Xeon, with hyperthreading (96 vCPUs)

Merge Sort multi-core

565

850

1064

48-core Intel Xeon, with hyperthreading (96 vCPUs)

This free library repo is provided as headers, similar to STL – i.e. there is no library to link to – you include the source code instead. Parallel Merge Sort, provided in the repo above, scales much better, reaching substantially higher performance than the standard C++ multi-core sort implementation.

Simple interfaces are provided, which are similar to the standard C++ sort. The interface is not quite as advanced as the one provided by standard C++, supporting only generic arrays and vectors, since I lack the skill of abstracting generic iterators properly.

While the algorithm is not in-place, some of the interfaces provided are, allocating a working array temporarily if enough memory is available. If insufficient memory is available, then the standard C++ sort is used, which is in-place, thus, providing an adaptive interface, like STL does with some of its algorithms, favoring performance when space is available, providing a classic Computer Science space-time trade-off.

The above high performance Parallel Merge Sort algorithm is a hybrid algorithm of Parallel Merge Sort using Parallel Merge at the top of the recursive method, with C++ sort at the bottom of recursion.

Sadly, the ARM-based machine in AWS are running an older version of Linux, which doesn’t support C++17. It would have been fun to compare performance on these.

Examples

Examples of usage for the Parallel Merge Sort and Parallel LSD Radix Sort, as well as standard C++ sort algorithms are provided in the repo, for VisualStudio 2019, g++, and Intel compiler, for Windows 10 and Linux. You can build and run these examples to see how performance compares on your target computer.

Resources

The Parallel Merge Sort algorithm is based on the algorithm described in “Introduction to Algorithms” 3rd edition by Cormen, Leiserson, Rivest and Stein (Chapter 27.3, pp. 797-810), with early termination to Insertion Sort.

C++17 blog https://www.bfilipek.com/2018/11/parallel-alg-perf.html

Microsoft C+17 blog https://devblogs.microsoft.com/cppblog/using-c17-parallel-algorithms-for-better-performance/

HPCsharp Nuget Package for C# : https://www.nuget.org/packages/HPCsharp

Advertisement

2 thoughts on “Faster C++ Sorting

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s