I’ve taken several attempts at parallelizing the LSD Radix Sort algorithm. This is a conceptual description of the latest attempt, which hopefully will work out well. My latest implementation of partially parallel version of LSD Radix Sort is performing very well, running at around 150 MegaInt32/sec implemented in C++ and nearly the same speed in C# (in HPCsharp nuget package). This version splits the algorithm into the counting phase and a permutation phase, where the counting phase uses multiple cores to gain higher performance. However, the permutation phase runs on a single core. This blog describes an attempt at using multiple cores for the permutation phase of the algorithm.
LSD Radix Sort Synopsis
The counting phase of LSD Radix Sort, reads each element of the input array sequentially. For each digit of each element, a histogram is generated, with a count for occurrence of each digit value. For example, if 8-bit digits are used, then 256 bins are created, counting the occurrence of value 0, value 1, …, value 255. Once the size of each bin has been established, then it is known where to place each input value in the output array, since we know where each bin will start within the output array.
The permutation phase of the LSD Radix Sort algorithm copies each value from the input array into the output array, placing each element into its bin (location within the output array). Several permutation passes are performed – one pass per digit. For instance, if the array elements are 32-bit integers, and we are using 8-bit digits, then 4 passes will be performed. For 64-bit elements with 8-bit digits, eight permutation passes will be performed.
Parallel Permutation Ideas
To perform the permutation phase of the algorithm in parallel, we could split the input array into chunks of work – work quantas. Each worker (e.g. CPU core, or task) would pick up a work quanta – a section of the input array – and would work on permuting array elements within just that section. This requires for each quanta of work to have enough information associated with it, to be able to accomplish this permutation work. The information that is necessary, is where each of the 256 bins start for that quanta of work. Thus, we need a 2-D array, where the upper dimension is work quanta, and the lower dimension is 256 bins of starting indexes.
To compute the starting index for each bin, for each work quanta, a more complex counting (histogram) method is needed. Counts for each work quanta, for each bin, are needed. These can be computed by performing a single pass over the input array, which is split into work quanta chunks, computing the counts for each of these chunks per bin. This determines how many array elements have the current digit that is 0, how many have the current digit that is a 1, and so on, up to digit of 255 (for 8-bit digits).
Transform Histogram into Starting Indexes
Once we know the counts for each bin, for each work quanta, we can determine starting indexes for each bin, for each work quanta. For bin 0, the starting index of work quanta 0 is zero. The starting index for work quanta 1, would be the starting index of work quanta 0 plus the count for bin 0 for work quanta 0. This pattern continues for starting index of work quanta 2 and so on, where its starting index is the starting index of the previous work quanta starting index plus the count for the same bin of previous work quanta.
At this point all within each bin, all of the work quantas starting indexes have been adjusted to account for the counts within each bin, but not to account for the bins before them – the starting index of work quanta 0 is zero for each bin – which is wrong. We need to compute the size of the overall bin 0 and adjust bin 1 starting indexes by adding that amount. To compute the size of bin 0, we can take the starting index of the last work quanta of bin 0 and add the count for bin 0 for the last work quanta. Then all of bin 1 starting indexes need to be adjusted by adding this value to each of them.
The above descriptions need pictures to make it clearer!