LSD Radix Sort algorithm has been around for a long time with first computer usage in 1950’s and 1960’s to sort punch cards by performing several passes at the stack of cards. It has resurfaced in the last decade with GPUs having hundreds and even thousands of processor with internal local truly random access memories to help accelerate sorting.

In this blog post I’ll describe three performance optimizations to help improve performance of LSD Radix Sort. All of these have been implemented in the C# HPCsharp NuGet package/library that is open source and free on nuget.org

### Ping Pong Between Two Buffers

LSD Radix Sort is not an in-place algorithm. It requires each phase of the process to be a stable sort. As each digit is processed from the least significant digit to the most significant digit, the input array is permuted into a separate output array. Common implementations of the LSD Radix Sort, such as the one in Prof. Sedgewick’s “Algorithms in C++” on p. 440, and the implementation in HPCsharp, both allocate an additional array of the same size as the input array to help during processing.

The implementation in Sedgewick’s book always performs the permutation from the input array to this additional array, followed by a copy back to the input array. This way the input array has the sorted result after each iteration, and at the end.

The implementation in HPCsharp removes the copies by ping-ponging between the input array and the additional array. For instance, during the first permutation pass using the least significant digit, each element is moved from the input array to the additional array. During the second pass, each element is moved from the additional array back to the input array, and so on alternating on each pass until all digits have been used to sort by.

### Count In a Single Pass

LSD Radix Sort performs a pass over the entire array for each digit. Most implementations perform a count using the current digit to determine the size of each bin, followed by the permutation pass. This is two passes over the array per digit: one to count and the other to permute. When sorting an array of 64-bit integers using 8-bit digits, that would take 16 passes over the input memory array.

Since the LSD Radix Sort performs a pass over the entire input array for each digit, and computes the size of bins for that digit, we could combine the counting passes for each digit into a single pass that is performed before all of the permutation passes. In other words, we could do a single counting pass over the input array instead of 8 passes as above.

The reason it’s possible to do a single pass is precisely because LSD Radix Sort always processes the entire array. In this case the size of the bins that are computed by each digit do not change in size after each permutation step. For example, given an array

7, 8, 10, 3, 3, 7, 10, 8, 7, 7, 3, 10

has three 3’s, four 7’s, two 8’s and three 10’s

3, 3, 3, 7, 7, 7, 7, 8, 8, 10, 10, 10

also has the same number of 3’s, 7’s, 8’s, and 10’s even though the array has been permuted – i.e. elements have been moved.

Thus, all of the digits of each array element can be pre-processed (counted) in a single pass before proceeding with the permutations passes. This implementation is in the HPCsharp NuGet package/library. This optimization made a significant performance difference for LSD Radix Sort, and turned the counting operation into a Histogram operation. Now, instead of a 1-D counting array, a 2-D counting array is produced, with the first dimension being the digit (8 digits for an array of 64-bit elements) with which the counts for the bins are associated with.

### Parallel Counting

The counting portion of the LSD Radix Sort and even of the MSD Radix Sort, can be easily parallelized to run on multiple cores and accelerates very well, because it only performs array element reads and produces a few results. In other words, O(N) reads and O(1) output.

Whether counting is performed on a single digit of each element of the array, or on all the digits, such as the Histogram above, parallelization to multiple cores performs wonderfully well, scaling to the bandwidth of CPU memory. Sadly, using data-parallel (SIMD/SSE) is not possible because these data-parallel instructions do not support index operations, which are needed for counting.

HPCsharp has a parallel version of the LSD Radix Sort, which parallelizes the counting/Histogram part of the overall algorithm. Parallelization of the rest of the LSD Radix Sort algorithm is in the works…

### LSD Radix Sort Implementation

Below is the top-level LSD Radix Sort implementation, which includes the above two performance optimizations:

public static ulong[] SortRadix(this ulong[] inputArray) { const int bitsPerDigit = 8; uint numberOfBins = 1 << bitsPerDigit; uint numberOfDigits = (sizeof(ulong) * 8 + bitsPerDigit - 1) / bitsPerDigit; int d = 0; var outputArray = new ulong[inputArray.Length]; uint[][] startOfBin = new uint[numberOfDigits][]; for (int i = 0; i < numberOfDigits; i++) startOfBin[i] = new uint[numberOfBins]; bool outputArrayHasResult = false; ulong bitMask = numberOfBins - 1; int shiftRightAmount = 0; uint[][] count = HistogramByteComponents(inputArray, 0, inputArray.Length - 1); for (d = 0; d < numberOfDigits; d++) { startOfBin[d][0] = 0; for (uint i = 1; i < numberOfBins; i++) startOfBin[d][i] = startOfBin[d][i - 1] + count[d][i - 1]; } d = 0; while (bitMask != 0) // end processing digits when all the mask bits have been processed and shifted out, leaving no bits set in the bitMask { uint[] startOfBinLoc = startOfBin[d]; for (uint current = 0; current < inputArray.Length; current++) outputArray[startOfBinLoc[(inputArray[current] & bitMask) >> shiftRightAmount]++] = inputArray[current]; bitMask <<= bitsPerDigit; shiftRightAmount += bitsPerDigit; outputArrayHasResult = !outputArrayHasResult; d++; ulong[] tmp = inputArray; // swap input and output arrays inputArray = outputArray; outputArray = tmp; } return outputArrayHasResult ? outputArray : inputArray; }

Notice in the above LSD Radix Sort implementation, the main while loop performs only the permutation of the array elements – i.e. moving array elements. The Histogram/Counting portion has been moved outside of this permutation loop. Also, the input and output arrays are being swapped after each of the permutation passes inside the while loop, eliminating copying. At the end the array holding the sorted array is returned, avoiding a copy once again. Parallel version of Counting/Histogram is not shown in the above implementation, but in HPCsharp library it has the same interface and simply extends the name ‘HistogramByteComponents’ to ‘HistogramByteComponentsPar’, where ‘Par’ is short for ‘Parallel’.

If you want to see the complete implementation, see the repository in https://github.com/DragonSpit/HPCsharp, in RadixSort.cs source file.