Sorting 19X Faster Than C++ Parallel Sort

(Rewritten on October 28, 2023) In my previous blog https://duvanenko.tech.blog/2020/02/03/faster-c-sorting/ Standard C++ Sort algorithm was benchmarked running on a single core at 11 million 32-bit integers per second. Its parallel version scaled up to 93 million integers per second on a 48-core Xeon processor AWS node (C5.24xlarge) – 8X speedup.

Also, my implementation of Parallel Merge Sort on the same machine, reached over 600 million integers per second – over 6X higher performance. This algorithm scales much better than the Standard C++ Parallel Sort with the number of processor cores. Two other algorithms: Parallel Radix Sort and Merge Radix Sort performed even better, reaching over 600 million integers per second. All implementations use Intel’s Threading Building Blocks, which significantly simplifies parallel implementation.

In this blog I’ll discuss two of these parallel sort algorithms which surpass 1 Billion 32-bit integers per second performance, on two generations newer Intel Xeon AWS node (C7i.24xlarge with 48-cores). This level of performance is more than 10X faster than Standard C++ Parallel Sort running on the same machine with the same input data set.

Benchmark Results

Array of 100 million 32-bit random integers is the input to all algorithms benchmarked. The same input data is used for all algorithms. The units in the Table below are millions of integers per second.

Algorithm	Performance	Parallel	In-Place	Stable
sort	8	no	yes	no
stable_sort	9	no	maybe	yes
sort	82	yes	yes	no
stable_sort	77	yes	maybe	yes
lsd radix sort	77	no	no	yes
merge radix sort	1136	yes	no	yes
lsd radix sort	1577	yes	no	yes

Standard C++ Parallel Sort accelerates by 10X on this 48-core Xeon processor. Stable version scales by 9X. Merge Radix Parallel Sort outperforms Standard C++ Parallel Sort by 14X, while Parallel LSD Radix Sort outperforms by 19X.

Least Significant Digit (LSD) Radix Sort running on a single core performs nearly as fast as Standard Parallel C++ Sort running on 48 cores.

Parallel LSD Radix Sort

Parallel Least Significant Digit (LSD) Radix Sort algorithm is described in the previous blog and my book (chapter 7). Parallel counting and parallel permutation are implemented to make this a fully parallel version. An additional parallel performance optimization tool is also introduced: de-randomization of writes to bins during the permutation phase.

This optimization serializes memory writes during the permutation phase to turn nearly random memory write accesses into sequential ones.

Parallel Merge Radix Sort

Another approach is to use Parallel Merge Sort to construct a divide-and-conquer recursive tree, with LSD Radix Sort at the leaf nodes. This LSD Radix Sort contains three novel optimizations to improve parallel performance:

dual-phase implementation, to reduce passes over input array nearly in half,
de-randomization of writes to bins during the permutation phase
shallow recursive tree of constant depth

These optimizations are described in the previous blog and my book. The first optimization has been integrated into NVidia’s top-performing Onesweep LSD Radix Sort for GPUs and used in the latest edition of Introduction to Algorithms book (p. 215 problem 8.3-4).

The third optimization limits the recursion tree of Merge Sort to be of constant depth by giving each leaf node N/M number of array elements to process, where N is the number of elements in the array and M is the number of processor cores. This structure leads to linear-time O(N) performance.

Price of Performance

Increased performance comes at a price of these algorithms being not-in-place, with double the memory footprint. Both algorithms are stable, however, and both have O(N) linear-time performance.

Radix Sort is not a comparison-based algorithm. To support various built-in data types for keys requires additional effort. Such implementations have been successfully done in the HPCsharp C# open source nuget package in C#. The same methodology can be incorporated into C++ implementations.

User data types with embedded keys can also be supported, with similar complexity to implementing a custom comparison function. An example has been implemented in C#, which can be easily translated to C++.

Availability

The initial implementation of this high performance algorithm is provided as open source and is free repo. See parallel_merge_sort_hybrid_radix function in the ParallelMergeSort.h file.

Other Data Distributions

Two other input data distributions are shown below: nearly pre-sorted and constant.

Algorithm	Performance	Parallel	In-Place	Stable
sort	28	no	yes	no
stable_sort	49	no	maybe	yes
sort	262	yes	yes	no
stable_sort	442	yes	maybe	yes
lsd radix sort	76	no	no	yes
merge radix sort	1303	yes	no	yes
lsd radix sort	1489	yes	no	yes

Algorithm	Performance	Parallel	In-Place	Stable
sort	1451	no	yes	no
stable_sort	62	no	maybe	yes
sort	1460	yes	yes	no
stable_sort	549	yes	maybe	yes
lsd radix sort	66	no	no	yes
merge radix sort	1386	yes	no	yes
lsd radix sort	1550	yes	no	yes

For nearly pre-sorted arrays Standard C++ Parallel Sort accelerates by 9X, and Stable Sort by 9X. Merge Radix Parallel Sort outperforms Standard C++ Parallel Sort by 5X, while Parallel LSD Radix Sort outperforms by 5.7X. They outperform Stable Sort by 3X.

For arrays filled with a constant value, Parallel Merge Radix Sort lags slightly behind Standard C++ Parallel Sort, while Parallel LSD Radix Sort slightly outperforms.

Both Parallel Merge Sort and especially Parallel LSD Radix Sort provide more consistent performance across input data distributions.

Algorithm Performance

Measure, Question, Improve, Do It Again…

Sorting 19X Faster Than C++ Parallel Sort

Benchmark Results

Parallel LSD Radix Sort

Parallel Merge Radix Sort

Price of Performance

Availability

Other Data Distributions

Leave a comment Cancel reply

Benchmark Results

Parallel LSD Radix Sort

Parallel Merge Radix Sort

Price of Performance

Availability

Other Data Distributions

Share this:

Related

Leave a comment Cancel reply