Fastest LSD Radix Sort in C++ on a Single CPU Core

I’ve been optimizing variety of Radix algorithms for over a decade: LSD and MSD Radix Sort, Single Core and Multi-Core, and Radix Selection. Several of these optimization techniques can be applied to all of these algorithms. In this blog, I’ll discuss each of these optimizations and show how much performance is gained, for sequential single core C++ implementation.

The following Table shows performance of LSD Radix Sort with each optimization when sorting and array of 1 billion unsigned 32-bit integers on a single core of Intel Core Ultra 9 275HX laptop computer. The units are millions of unsigned 32-bit integers per second.

Optimization	Random	Presorted	Constant
Baseline	154	159	89
Two Phase	177	167	143
Two Phase & Derandomization	203	189	110
Two Phase & Constant Optimization	184	169	292
Two Phase & Derandomization & Constant Optimization	214	200	253

With all optimization, the LSD Radix Sort algorithm saw a speed up of 40% for random data, 26% for presorted data, and 284% for constant data. Performance variation across different input data distributions has been reduced from 173% for the Baseline implementation, to 27% for the optimized version. Worst case performance, constant array for the Baseline versus presorted for the optimized, improved by 225%.

These implementation are available in this open source repository. Let’s discuss each of these optimizations in more detail.

For reference, C++ std::sort() performs as follows for the same data distributions:

Algorithm	Random	Presorted	Constant
std::sort()	13	52	4081
std::stable_sort()	13	74	95

Keep in mind that the LSD Radix Sort is a stable sorting algorithm.

Two Phase

This optimization was developed a few years ago and described in my Faster LSD Radix Sort blog. It splits counting and permutation into separate phases, with counting performed only once, while performing permutation per digit. It works from the realization that counting for all digits can be done in a single pass over the array before any of the permutations. This works because for LSD Radix Sort counting is always over the entire array, and counts will not change because array elements are permuted.

Counting is a faster operation versus permutation, since during counting the input array is read sequentially from system memory which is optimal for memory bandwidth.

Derandomization

During the permutation phase of any Radix Sort algorithm (LSD or MSD), each input array element is written to the bin where it belongs. When 8-bits/digit is used, up to 256 bins are created. These bins will be written in data-dependent order. When data is random, then each of the 256 bins will be written in random order. Current computer system memory architecture performs well for sequential accesses and performs poorly (as much as 300X slower) for random accesses.

To improve system memory access pattern during permutation, a small buffer for each bin can be used to collect multiple elements in CPU cache before writing them into system memory. Once a particular bin’s buffer is full, that buffer is copied into system memory sequentially – more efficiently. Random access writes are done writing into these buffers in CPU cache, which performs well for random access writes, while sequentially writing to system memory. An additional benefit is that copies from the cache buffer to bins in system memory are performed using SSE instructions.

Allocating buffers for all bins as a single contiguous array helps these buffers stay in CPU cache and not interfere with each other. This optimization performs twice the memory writes, while being overall more efficient. Since CPU caches use a write-back strategy, twice the number of writes does not mean twice the number of writes to system memory.

Constant Optimization

In a recent blog, I introduced a new optimization which improves performance for constant arrays. This performance issue arises because a memory location is being incremented over and over again in a tight loop, causing a loop dependency. Instead, a register is used to hold the array index, instead of holding the index in memory. Incrementing a register is significantly faster than incrementing a memory location.

Conclusion

Several performance optimization were discussed for the single core LSD Radix Sort. Some of these also apply to other Radix algorithms, such as MSD Radix Sort and Radix Selection. Performance across input data distributions was improved significantly, along with increasing performance of the worst case distribution. Some of these ideas should also apply to parallel Radix algorithms.

Algorithm Performance

Measure, Question, Improve, Do It Again…

Fastest LSD Radix Sort in C++ on a Single CPU Core

Two Phase

Derandomization

Constant Optimization

Conclusion

Leave a comment Cancel reply

Two Phase

Derandomization

Constant Optimization

Conclusion

Share this:

Related

Leave a comment Cancel reply