Faster C++ Sorting

C++ has included wonderful implementations of sorting algorithms over the years. Recently, with C++17 support for parallelism, sorting performance has skyrocketed by running on all of the available cores. The number of cores is predicted to grow in double-digit percentage per year, as competition between Intel, AMD, ARM and other processor vendors heats up. The core race will give these parallel algorithms the power to give us even faster performance.

Not only is the number of cores going up, but along with it, memory bandwidth is being scaled. From two channels in laptops and desktops, to four, six and eight channels in workstations. Some dual-socket CPU designs double even that. Scaling of the number of computational cores along with scaling of bandwidth, plays right into faster parallel algorithms, one of which is parallel sorting.

I put together a free open source repo to show how standard C++17 sorting performance scales with the number of CPU cores. The same code runs on Windows and Ubuntu 20.04 Linux, which supports the latest g++ with C++17. A VisualStudio 2019 solution is included, using the Microsoft compiler and Intel’s OneAPI compiler. Instructions on how to build on Ubuntu 20.04 Linux are provided in the project Readme.

This repo also includes several sorting algorithms, which I have developed over the years, starting from my Dr. Dobb’s column on Parallel Algorithms. Algorithms such as Parallel Merge Sort, LSD Radix Sort, Parallel LSD Radix Sort, and Parallel In-Place Merge Sort are included, with more to follow, along with optimizations developed in https://www.nuget.org/packages/HPCsharp (C# nuget package, called HPCSharp).

C++17 Parallel Sort Performance

Benchmark results of the following C++17 algorithms on Ubuntu 20.04 using g++, sorting an array of 10 Million unsigned long integers (32-bits each):

single core: sort(sorted.begin(), sorted.end())
multi-core: sort(std::execution::par_unseq, sorted.begin(), sorted.end())

Algorithm	Random	Presorted	Description
sort single core	11	32	48-core Intel Xeon, with hyperthreading (96 vCPUs)
sort multi-core	93	250	48-core Intel Xeon, with hyperthreading (96 vCPUs)

The latest performance benchmarks are also listed in (https://github.com/DragonSpit/ParallelAlgorithms). Scaling by a bit over 8X when running on a 48-core processor versus running on a single core is a bit disappointing.

A Higher Performance Parallel Algorithm

A higher performance sorting algorithm is provided in my open source repo. Parallel Merge Sort is generic, comparison based algorithm, with performance shown in the following table:

Algorithm	Random	Presorted	Constant	Computer
Merge Sort single core	13	109	169	6-core Intel i7-9750H, with hyperthreading
Merge Sort multi-core	108	261	285	6-core Intel i7-9750H, with hyperthreading
Merge Sort single core	12	100	148	48-core Intel Xeon, with hyperthreading (96 vCPUs)
Merge Sort multi-core	565	850	1064	48-core Intel Xeon, with hyperthreading (96 vCPUs)

This free library repo is provided as headers, similar to STL – i.e. there is no library to link to – you include the source code instead. Parallel Merge Sort, provided in the repo above, scales much better, reaching substantially higher performance than the standard C++ multi-core sort implementation.

Simple interfaces are provided, which are similar to the standard C++ sort. The interface is not quite as advanced as the one provided by standard C++, supporting only generic arrays and vectors, since I lack the skill of abstracting generic iterators properly.

While the algorithm is not in-place, some of the interfaces provided are, allocating a working array temporarily if enough memory is available. If insufficient memory is available, then the standard C++ sort is used, which is in-place, thus, providing an adaptive interface, like STL does with some of its algorithms, favoring performance when space is available, providing a classic Computer Science space-time trade-off.

The above high performance Parallel Merge Sort algorithm is a hybrid algorithm of Parallel Merge Sort using Parallel Merge at the top of the recursive method, with C++ sort at the bottom of recursion.

Sadly, the ARM-based machine in AWS are running an older version of Linux, which doesn’t support C++17. It would have been fun to compare performance on these.

Examples

Examples of usage for the Parallel Merge Sort and Parallel LSD Radix Sort, as well as standard C++ sort algorithms are provided in the repo, for VisualStudio 2019, g++, and Intel compiler, for Windows 10 and Linux. You can build and run these examples to see how performance compares on your target computer.

Resources

The Parallel Merge Sort algorithm is based on the algorithm described in “Introduction to Algorithms” 3rd edition by Cormen, Leiserson, Rivest and Stein (Chapter 27.3, pp. 797-810), with early termination to Insertion Sort.

C++17 blog https://www.bfilipek.com/2018/11/parallel-alg-perf.html

Microsoft C+17 blog https://devblogs.microsoft.com/cppblog/using-c17-parallel-algorithms-for-better-performance/

HPCsharp Nuget Package for C# : https://www.nuget.org/packages/HPCsharp

Algorithm Performance

Measure, Question, Improve, Do It Again…

Faster C++ Sorting

C++17 Parallel Sort Performance

A Higher Performance Parallel Algorithm

Examples

Resources

3 thoughts on “Faster C++ Sorting”

Leave a comment Cancel reply

C++17 Parallel Sort Performance

A Higher Performance Parallel Algorithm

Examples

Resources

Share this:

Related

3 thoughts on “Faster C++ Sorting”

Leave a comment Cancel reply