I ran across a research paper recently on Radix Selection where the authors mentioned that LSD Radix Sort is known to have performance issues when sorting arrays of data with certain distributions. They did not elaborate what those distributions were. Also, Professor Sedgewick also mentioned that the worst case for Radix Sort is when the input array is filled with a constant value – i.e. every array element has the same value. In this blog, I’ll explore several problematic input data distributions for LSD Radix Sort, will show which parts of the algorithm are struggling, ponder potential reasons, and will explore solutions, showing their performance gains.
C++ Problematic Input Data Distributions
The following table shows performance of LSD Radix Sort running on a single core of an Intel Core Ultra 9 275HX processor, sorting 1 Billion 32-bit unsigned integers:
| Random Data | 5.59 seconds |
| Presorted Data | 5.95 seconds |
| Constant Data | 10.1 seconds |
It looks like there is a little bit of a problem with presorted data and a big performance issue with arrays filled with the same value (constant data) – nearly 2X performance decline.
Digging Deeper
The LSD Radix Sort implementation used is a two-phase implementation, where in the first phase a single counting/histogram pass over the array is performed to create a counting array for all the digits of the keys. The second phase performs as many permutations as there are digits in the keys. For 32-bit unsigned integer keys, the algorithm would perform a single counting pass, followed by 4 permutation passes over the input array.
The following table shows performance for each of these steps:
| Counting/Histogram of Random Data | 0.924 seconds |
| Permutation of Random Data | 1.1 seconds |
| Counting/Histogram of Presorted Data | 1.5 seconds |
| Permutation of Presorted Data | 1.2 seconds |
| Counting/Histogram of Constant Data | 1.6 seconds |
| Permutation of Constant Data | 2.1 seconds |
The above measurements are showing that something is going in the permutation code. There is some sort of sensitivity to certain data distributions, especially in the case of constant data, resulting in nearly 2X slow down. The Counting/Histogram part also slows down by over 50% for presorted and constant data cases.
The Problem Code
Interestingly, both the Counting/Histogram and permutation do a similar type of an operation:
array[index]++
In both cases, the index is the current digit of the key value. In both cases, a tight for loop is used to iterate over the array, calling “array[index]++” for every array element. For Counting/Histogram array elements are read, while for permutation, array elements are written.
When the input array consists of constant values, the index will be a constant value, which means that the same location within the array is being incremented. This created a loop dependency, where one increment operation must finish before another one can start.
What makes this even worse is that it is a memory operation – an element of an array is being incremented. When the index is constant (in the case of constant arrays), then the same array/memory location is being incremented, which is slower than if this was a register being incremented.
An Improvement
For permutation, a possible improvement is to put “array[index]” into a register and increment this register instead of incrementing a memory location whenever the index is constant. When the index is detected to not be constant, then update the “array[old_index]” from the register value and switch to the new “array[new_index]”.
For LSD Radix Sort this did not slow down the random and presorted data cases, but improved the case for constant arrays by over 4X, which made it be the fastest case instead of the slowest.
| Permutation of Random Data | 1.2 seconds |
| Permutation of Presorted Data | 1.3 seconds |
| Permutation of Constant Data | 0.5 seconds |
Improved Code
The following two code snippets are the before and after code – the permute part of the LSD Radix Sort. The first snippet is the original code with the issue of incrementing “endOfBin[index]++” array/memory:
for (size_t _current = 0; _current <= last; _current++)
{
unsigned digit = (_input_array[_current] & bitMask) >> shiftRightAmount;
_output_array[endOfBin[digit]++] = _input_array[_current];
}
While the second snippet contains more lines of code, performing similarly for random and presorted distributions and 4X faster for the constant distribution:
unsigned prev_digit = (_input_array[0] >> shiftRightAmount) & bit_mask;
size_t index = endOfBin[prev_digit];
_output_array[index++] = _input_array[0];
for (size_t _current = 1; _current <= last; _current++)
{
unsigned digit = (_input_array[_current] >> shiftRightAmount) & bit_mask;
if (digit != prev_digit)
{
endOfBin[prev_digit] = index;
index = endOfBin[digit];
prev_digit = digit;
}
_output_array[index++] = _input_array[_current];
}
This code snippt uses “index” as a register which is incremented instead of incrementing an array/memory element, resulting in a speedup of the worst case while barely affecting other cases. The worst case being when all array elements are of constant value, which results in the “digit” being the same value, and incrementing the same array element – i.e. “endOfBin[same_digit]++”.
Also, an interesting observation about this optimization – the problem was not with system memory accesses, but with array/memory increment which should be in CPU cache. For constant arrays, system memory access is sequential for the source and destination array, which is optimal. However, “endOfBin[same_digit]++” is the performance bottleneck.
Another Case
In Counting/Histogram case, the input array is read and the count array is filled in using:
for (size_t _current = 0; _current <= last; _current++){ unsigned digit = (_input_array[_current] & bitMask) >> shiftRightAmount; count[digit]++;}
In this case, when the input array is filled with constant values, the same counting array location is incremented over and over. This is a loop dependency once again.
Removing the dependency is simpler in this case, by creating one or more copies of the counting array, alternating which of the counting arrays gets incremented. Once the input array has been completed, the multiple copies of the counting arrays are combined into a single counting array. For modern processors up to 4 copies of the count[] array produce a speedup.
Conclusion
Sometimes it’s worth bringing up performance of the worst case distribution while sacrificing performance of other distributions slightly to provide more consistent overall performance. Plus, when evaluating algorithms, the worst case performance is one of the crucial attributes.