I’ve taken several attempts at parallelizing the LSD Radix Sort algorithm. This blog provides the key concepts and details about the latest attempt, which has succeeded and is beginning to pay off dividends of higher performance.
The following table shows performance of three variations of the LSD Radix Sort on two different multi-core machines, in Millions of UInt32’s per second, implemented in C#
|LSD Radix Sort||104||45||90||1-core i7-9750H|
|LSD Radix Sort (Partially Parallel)||118||48||112||6-core i7-9750H|
|LSD Radix Sort (Parallel)||195||203||225||6-core i7-9750H|
|LSD Radix Sort (Parallel)||316||360||370||48-core AWS|
C++ implementation of the serial and partially parallel versions perform 25% faster. Currently, Parallel Merge Sort reaches higher performance on the 48-core processor (see https://github.com/DragonSpit/HPCsharp#Merge-Sort), for now.
Partially Parallel Version
Currently, my HPCsharp open source C# Nuget package, available on nuget.org, contains a partially parallel implementation of the LSD Radix Sort algorithm. It is performing well, running at 120 Million Int32/sec on a 6-core laptop. An equivalent C++ implementation, available in a git repo https://github.com/DragonSpit/ParallelAlgorithms runs at nearly 150 MegaInt32/sec.
This version splits the algorithm into two phases – a counting phase and a permutation phase – where the counting phase uses multiple cores to gain 25% higher performance. This idea is described in my blog https://duvanenko.tech.blog/2019/02/27/lsd-radix-sort-performance-improvements/. However, the permutation phase runs on a single core, making the overall LSD Radix Sort algorithm only partially parallel.
To parallelize more of the LSD Radix Sort algorithm, the permutation phase needs to be parallelized. To accomplish this, we split the input array into several sub-arrays. Each worker (e.g. CPU core) permutes elements of a sub-array it is given. All available workers do this in parallel.
To be able to permute each sub-array correctly, additional information needs to be provided along with the sub-array input elements. For each of the 256 bins, we need to know where to place elements of each sub-array as we permute them. In other words, the start index within each bin is needed, for each sub-array. This is simple to do, during the counting phase of the LSD Radix Sort algorithm. As we count elements of the input array, instead of counting the entire input array, we count elements of each sub-array.
From the counts of each sub-array, we can create the additional information that we need for each sub-array – where to place the elements of that sub-array within each bin. In other words, instead of the start index for each bin, we determine the start index for each sub-array within each bin. This requires a 2-D array, where the upper dimension specifies the sub-array index, and the lower dimension holds the starting indexes for each of the 256 bins. This way each sub-array knows where its elements need to go within each bin, when it permutes them.
De-randomization of Writes
The permutation phase of the algorithm is the most time consuming portion of the LSD Radix Sort. During this phase, the input array is read sequentially, which is efficient. However, the writes are to one of the 256 bins, potentially in random order. This order is dependent on the input data itself. These possibly random writes, are more than 10X slower than sequential writes. This performance drop was shown in my blog on memory accesses https://duvanenko.tech.blog/2020/03/07/memory-access/. Truly random writes were shown to be 350X slower than sequential writes.
Random write accesses put significant stress on modern system memory sub-system, reducing the overall available bandwidth, reducing the ability for many cores to write in parallel. This reduction in bandwidth, limits scaling of the algorithm on multi-core processors. For example, 27 GBytes/sec of write memory bandwidth is reduced to 2.7 GBytes/sec.
To improve write performance we can de-randomize writes, by adding buffering for each of the 256 bins. Each bin would have a buffer of some number of elements (e.g. 64). We would write to each of these buffers instead of writing directly to system memory. Once a particular buffer has filled up, it is written sequentially to system memory. This type of a buffering scheme helps parallel performance scale significantly.
One nice benefit of using de-randomization thru buffering, is that performance of the LSD Radix Sort algorithm gets more consistent for different input distributions. For example, arrays that are filled with random data perform very similarly to arrays filled with already presorted data, or constant value data. This performance improvement holds for parallel and serial algorithms. It drops performance for serial algorithm slightly for random input data, but brings up performance for presorted, or incrementing input data distribution.
The following table is for C++ implementations, not C#. C# implementations perform similarly, just a bit slower:
|LSD Radix Sort||Mega unsigned / sec||Distribution||Computer|
|Two-Phase Parallel||153||Random||6-core laptop|
|Two-Phase Parallel||140||Pre-Sorted||6-core laptop|
|Two-Phase Parallel||121||Constant||6-core laptop|
|Two-Phase Parallel De-rand||129||Random||6-core laptop|
|Two-Phase Parallel De-rand||105||Pre-Sorted||6-core laptop|
|Two-Phase Parallel De-rand||118||Constant||6-core laptop|
Two Phase Strategy Is Not Possible
Separating the LSD Radix Sort algorithm into the counting phase and the permutation phase, is a really good optimization to reduce the number of passes over the input array. It is described in my blog (https://duvanenko.tech.blog/2019/02/27/lsd-radix-sort-performance-improvements/). This idea works great when the entire input array will be processed. However, this idea does not work when processing sub-arrays.
When permuting the entire array at a time, while processing each digit, from the least significant digit to the most, the counts are not affected by permuting array elements. However, they do change, when permuting sub-arrays, because the input array elements will not stay within the bounds of a sub-array being permuted. As we permute a sub-array, its elements can move anywhere within the full array – i.e. they don’t stay within the bounds of that particular sub-array. Thus, we will need to perform counting after processing the input array for each digit. In other words, for parallel LSD Radix Sort, two phase strategy is not correct, and we need to use the old strategy of counting followed by permutation, for each digit.
Our overall processing becomes:
- outer loop is over the number of digits, starting with the least significant digit and moving toward the most significant digit
- within the loop, perform counting for that digit of each sub-array, across the entire input array
- compute starting index for each sub-array, for each bin
- process all sub-arrays in parallel
The resulting Parallel LSD Radix Sort has been parallelized more extensively. The two biggest computing parts of the algorithm have been parallelized. The counting phase for each digit has been parallelized. The permutation phase has also been parallelized during processing of each digit. Digits are still processed serially in a loop, from the least significant digit to the most significant digit.
Consistency of Performance
Performance of many sorting algorithms varies substantially across input data distributions. For example, quicksort and Introspective Sort perform about 6X slower for an input array filled with random data, versus a pre-sorted array. They also perform about 3X slower for an array filled with constant data.
Typical LSD Radix Sort algorithm performs better for random data, slowing down by over 2X for pre-sorted data, and slowing down by 20% for constant data. The above implementations of this algorithm were shown to perform much more consistently across input data distributions.
The parallel LSD Radix Sort performs more consistently across variety of input data distributions, with slightly faster performance for pre-sorted and constant input data distributions. De-randomization buffering of the write data is the reason for the nearly uniform performance across input distributions.
Where to Find
The fully parallel LSD Radix Sort algorithm has been implemented in C# inside HPCsharp nuget package on nuget.org, starting with version 3.15.0. The source code is open and is available in the git repo https://github.com/DragonSpit/HPCsharp inside RadixSortLsdParallel.cs. C++ version will also be available soon.
Transform Histogram into Starting Indexes
Once we know the counts for each bin, for each sub-array, we can determine starting indexes for each bin, for each sub-array. For bin 0, the starting index of sub-array 0 is zero. The starting index for sub-array 1, would be the starting index of sub-array 0 plus the count for bin 0 for sub-array 0. This pattern continues for starting index of sub-array 2 and so on, where its starting index is the starting index of the previous sub-array starting index plus the count for the same bin of previous sub-array.
At this point all within each bin, all of the sub-array’s starting indexes have been adjusted to account for the counts within each bin, but not to account for the bins before them – the starting index of sub-array 0 is zero for each bin – which is wrong. We need to compute the size of the overall bin 0 and adjust bin 1 starting indexes by adding that amount. To compute the size of bin 0, we can take the starting index of the last sub-array of bin 0 and add the count for bin 0 for the last sub-array. Then all of bin 1 starting indexes need to be adjusted by adding this value to each of them.
The above descriptions need pictures to make it clearer!