LSD Radix Sort algorithm has been around for a long time with first computer usage in 1950’s and 1960’s to sort punch cards by performing several passes at the stack of cards. It has resurfaced in the last decade with GPUs having hundreds and even thousands of processor with internal local truly random access memories to help accelerate sorting.

In this blog post I’ll describe three performance optimizations to help improve performance of LSD Radix Sort. All of these have been implemented in the C# HPCsharp NuGet package/library that is open source and free on nuget.org. These are also available in the C++ repo: https://github.com/DragonSpit/ParallelAlgorithms, also open source and free.

### Ping Pong Between Two Buffers

LSD Radix Sort is not an in-place algorithm. It requires each phase of the process to be a stable sort. As each digit is processed from the least significant digit to the most significant digit, the input array is permuted into a separate output array. Implementations of the LSD Radix Sort, such as the one in Prof. Sedgewick’s “Algorithms in C++” on p. 440, and the implementation in HPCsharp, both allocate an additional array of the same size as the input array to help during processing.

The implementation in Sedgewick’s book always performs the permutation from the input array to this additional array, followed by a copy back to the input array. This way the input array has the sorted result after each iteration, and at the end.

The implementation in HPCsharp removes the copies by ping-ponging between the input array and the additional array. For instance, during the first permutation pass using the least significant digit, each element is moved from the input array to the additional array. During the second pass, each element is moved from the additional array back to the input array, and so on alternating on each pass until all digits have been used to sort by.

### Count In a Single Pass

LSD Radix Sort performs a pass over the entire array for each digit. Most implementations perform a count using the current digit to determine the size of each bin, followed by  the permutation pass. This method requires two passes over the array per digit: one to count and the other to permute. When sorting an array of 64-bit integers using 8-bit digits, that would require 16 passes over the input memory array.

Since the LSD Radix Sort performs a pass over the entire input array for each digit, and computes the size of bins for that digit, we could combine the counting passes for each digit into a single pass that is performed before all of the permutation passes. In other words, we could do a single counting pass over the input array instead of 8 passes as above.

The reason it is possible to do a single pass is precisely because LSD Radix Sort always processes the entire array. In this case the size of the bins that are computed by each digit do not change in size after each permutation step. For example, given an array

7, 8, 10, 3, 3, 7, 10, 8, 7, 7, 3, 10

has three 3’s, four 7’s, two 8’s and three 10’s

3, 3, 3, 7, 7, 7, 7, 8, 8, 10, 10, 10

also has the same number of 3’s, 7’s, 8’s, and 10’s even though the array has been permuted – i.e. elements have been moved.

Thus, all of the digits of each array element can be pre-processed (counted) in a single pass before proceeding with the permutations passes. This implementation is in the HPCsharp NuGet package/library for C#, and is implemented in C++ in https://github.com/DragonSpit/ParallelAlgorithms. This optimization made a significant performance difference for LSD Radix Sort, and turned the counting operation into a Histogram operation. Now, instead of a 1-D counting array, a 2-D counting array is produced, with the first dimension being the digit (8 digits for an array of 64-bit elements) with which the counts for the bins are associated with.

This optimization becomes more significant for larger data types, as it reduces the number of passes over the input buffer even more.

### Parallel Counting

The counting portion of the LSD Radix Sort and even of the MSD Radix Sort, can be easily parallelized to run on multiple cores and accelerates very well, because it only performs array element reads and produces a few results. In other words, O(N) reads and O(1) output.

Whether counting is performed on a single digit of each element of the array, or on all the digits, such as the Histogram above, parallelization to multiple cores performs wonderfully well, scaling to the bandwidth of CPU memory. Sadly, using data-parallel (SIMD/SSE) is not possible because these data-parallel instructions do not support index operations, which are needed for counting.

HPCsharp has a parallel version of the LSD Radix Sort, which parallelizes the counting/Histogram part of the overall algorithm. Parallelization of the rest of the LSD Radix Sort algorithm has been implemented, is included in HPCsharp, with the details described in https://duvanenko.tech.blog/2020/02/17/parallel-lsd-radix-sort/ blog entry.

Below is the top-level LSD Radix Sort implementation, which includes the above two performance optimizations:

```public static ulong[] SortRadix(this ulong[] inputArray)
{
const int bitsPerDigit = 8;
uint numberOfBins = 1 << bitsPerDigit;
uint numberOfDigits = (sizeof(ulong) * 8 + bitsPerDigit - 1) / bitsPerDigit;
int d = 0;
var outputArray = new ulong[inputArray.Length];

uint[][] startOfBin = new uint[numberOfDigits][];
for (int i = 0; i < numberOfDigits; i++)
startOfBin[i] = new uint[numberOfBins];
bool outputArrayHasResult = false;

ulong bitMask = numberOfBins - 1;
int shiftRightAmount = 0;

uint[][] count = HistogramByteComponents(inputArray, 0, inputArray.Length - 1);

for (d = 0; d < numberOfDigits; d++)
{
startOfBin[d] = 0;
for (uint i = 1; i < numberOfBins; i++)
startOfBin[d][i] = startOfBin[d][i - 1] + count[d][i - 1];
}

d = 0;
while (bitMask != 0)    // end processing digits when all the mask bits have been processed and shifted out, leaving no bits set in the bitMask
{
uint[] startOfBinLoc = startOfBin[d];
for (uint current = 0; current < inputArray.Length; current++)
outputArray[startOfBinLoc[(inputArray[current] & bitMask) >> shiftRightAmount]++] = inputArray[current];

shiftRightAmount += bitsPerDigit;
outputArrayHasResult = !outputArrayHasResult;
d++;

ulong[] tmp = inputArray;       // swap input and output arrays
inputArray = outputArray;
outputArray = tmp;
}
return outputArrayHasResult ? outputArray : inputArray;
}```

Notice in the above LSD Radix Sort implementation, the main while loop performs only the permutation of the array elements – i.e. moving array elements. The Histogram/Counting portion has been moved outside of this permutation loop. Also, the input and output arrays are being swapped after each of the permutation passes inside the while loop, eliminating copying. At the end the array holding the sorted array is returned, avoiding a copy once again. Parallel version of Counting/Histogram is not shown in the above implementation, but in HPCsharp library it has the same interface and simply extends the name ‘HistogramByteComponents’ to ‘HistogramByteComponentsPar’, where ‘Par’ is short for ‘Parallel’.

If you want to see the complete implementation, see the repository in https://github.com/DragonSpit/HPCsharp, in RadixSort.cs source file.

### Performance in C++

The following table shows successively increasing performance of several implementations of the LSD Radix Sort:

 LSD Radix Sort Mega unsigned / sec Distribution Computer Base algorithm 111 Random 6-core laptop Base algorithm 104 Pre-Sorted 6-core laptop Base algorithm 72 Constant 6-core laptop Two-Phase 120 Random 6-core laptop Two-Phase 119 Pre-Sorted 6-core laptop Two-Phase 101 Constant 6-core laptop Two-Phase Parallel 153 Random 6-core laptop Two-Phase Parallel 140 Pre-Sorted 6-core laptop Two-Phase Parallel 121 Constant 6-core laptop

The above table shows how much the LSD Radix Sort implementation described in this blog improve performance, when running on a 6-core laptop with an i7 9750H Intel processor. C++ code is available in https://github.com/DragonSpit/ParallelAlgorithms – is open source and free. The fastest algorithm parallelizes the counting phase of the LSD Radix Sort algorithm, but not the permutation phase, as the C# implementation in HPCsharp does.