Fastest LSD Radix Sort in C++ on a Single CPU Core

I’ve been optimizing variety of Radix algorithms for over a decade: LSD and MSD Radix Sort, Single Core and Multi-Core, and Radix Selection. Several of these optimization techniques can be applied to all of these algorithms. In this blog, I’ll discuss each of these optimizations and show how much performance is gained, for sequential single core C++ implementation.

The following Table shows performance of LSD Radix Sort with each optimization when sorting and array of 1 billion unsigned 32-bit integers on a single core of Intel Core Ultra 9 275HX laptop computer. The units are millions of unsigned 32-bit integers per second.

OptimizationRandomPresortedConstant
Baseline16617174
Two Phase177165104
Two Phase & Derandomization203189110
Two Phase & Constant Optimization180161278
Two Phase & Derandomization & Constant Optimization205191240

With all optimization, the LSD Radix Sort algorithm saw a speed up of 23% for random data, 12% for presorted data, and 324% for constant data. Performance variation across different input data distributions has been reduced from 230% for the Baseline implementation, to 25% for the optimized version. Worst case performance, constant array for the Baseline versus presorted for the optimized, improved by 258%.

These implementation are available in this open source repository. Let’s discuss each of these optimizations in more detail.

For reference, C++ std::sort() performs as follows for the same data distributions:

AlgorithmRandomPresortedConstant
std::sort()13524081
std::stable_sort()137495

Keep in mind that the LSD Radix Sort is a stable sorting algorithm.

Two Phase

This optimization was developed a few years ago and described in my Faster LSD Radix Sort blog. It splits counting and permutation into separate phases, with counting performed only once, while performing permutation per digit. It works from the realization that counting for all digits can be done in a single pass over the array before any of the permutations. This works because for LSD Radix Sort counting is always over the entire array, and counts will not change because array elements are permuted.

Counting is a faster operation versus permutation, since during counting the input array is read sequentially from system memory which is optimal for memory bandwidth.

Derandomization

During the permutation phase of any Radix Sort algorithm (LSD or MSD), each input array element is written to the bin where it belongs. When 8-bits/digit is used, resulting in 256 bins. These bins will be written in data-dependent order. When data is random, then each of the 256 bins will be written in random order. Current computer system memory architecture performs well for sequential accesses and performs poorly (as much as 300X slower) for random accesses.

To improve memory access pattern during permutation, a small buffer for each bin can be used to collect multiple elements in CPU cache before writing them into system memory. Once a particular bin’s buffer is full, that buffer is copied into system memory sequentially. Another benefit is that copies from the buffer to bins in system memory are performed using SSE instructions.

Allocating buffers for all bins as a contiguous array helps these buffers stay in CPU cache and not interfere with each other. This optimization performs more memory writes, while being more efficient.

Constant Optimization

In a recent blog, I introduced a new optimization which improves performance for constant arrays. This performance issue arises because a memory location is being incremented over and over again in a tight loop, causing a loop dependency. Instead, a register is used to hold the array index, instead of holding the index in memory. Incrementing a register is significantly faster than incrementing a memory location.

Conclusion

Several performance optimization were discussed for the single core LSD Radix Sort. Some of these also apply to other Radix algorithms, such as MSD Radix Sort and Radix Selection. Performance across input data distributions was improved significantly, along with increasing performance of the worst case distribution. Some of these ideas should also apply to parallel Radix algorithms.

Leave a comment