In my previous blog https://duvanenko.tech.blog/2020/02/03/faster-c-sorting/ I benchmarked standard C++ Sort algorithm on a single core at 11 Million 32-bit integers per second. Parallel C++ Sort runs at 93 Million integers per second on a 48-core Xeon CPU on AWS. I also benchmarked Parallel Merge Sort on the same machine, which reaches over 600 Million integers per second. This algorithm is not in-place and scales much better than the standard C++ Parallel Sort algorithm. with the number of processor cores.

In this blog I’ll discuss a hybrid parallel algorithm that reaches a Billion 32-bit integers per second on the same 48-core AWS machine. This level of performance is nearly two orders of magnitude (i.e. nearly 100X) that of standard C++ Sort algorithm running on a single core.

## Parallel Hybrid Algorithm

Parallel Merge Sort algorithm presented before is a hybrid algorithm. It uses a recursive divide-and-conquer approach to create a Parallel Merge Sort. It also uses a Parallel Merge algorithm as well, with Insertion Sort at the leaf nodes of the divide-and-conquer implementation. This technique performs really well and is generic – i.e. works for any data type which supports a comparison. It scales well with more cores and more memory bandwidth, and is very memory access pattern friendly, because Merge accesses memory sequentially.

Another approach is to use a different algorithm at the leaf node of the Merge Sort divide-and-conquer recursive tree, such as the LSD Radix Sort. In a previous blog (https://duvanenko.tech.blog/2019/02/27/lsd-radix-sort-performance-improvements/), I described a novel approach to the LSD Radix Sort, which splits the LSD Radix Sort into two phases: the counting phase, and the permutation phase. This approach is faster than the traditional approach where for each digit, a counting phase and a permutation phase are performed. The new approach reduced the number of passes performed over the input data array, nearly in half.

This hybrid combination of Parallel Merge Sort with LSD Radix Sort, performs significantly faster, when each leaf node is given N/M number of elements, where N is the number of elements in the array and M is the number of processor cores.

## Less Generic

Increased performance comes at a price of the LSD Radix Sort being less generic than Merge Sort, where supporting various data types is harder to accomplish. In the HPCSharp C# nuget package, I have implemented an MSD Radix Sort for variety of numeric data types. These techniques can be applied to the LSD Radix Sort as well to support sorting on keys of any numerical data type.

## Availability

The initial implementation of this high performance algorithm is provided as open source and is free (https://github.com/DragonSpit/ParallelAlgorithms). See the parallel_merge_sort_hybrid_radix function in the ParallelMergeSort.h file.

Support for keys of other numeric data types will be available, but will not be free. Please contact me for pricing and availability.