Radix Selection Optimizations

In my previous blog post on Radix Selection, the algorithm which returns the k-th largest value from an unsorted array was shown to be significantly faster than sorting, because it performs less work. In this blog, I’ll explore further optimizations to make Radix Sort even faster.

More Bits Per Digit

Radix Sort (LSD and MSD varieties) commonly use 8-bits per digit. For example, when sorting an array of 32-bit keys, LSD Radix Sort performs 4 passes over the array, processing 8-bits (or one digit) at a time.

Radix Selection can also use digits of the same size, with performance results shown in my previous blog. Digits with more bits can also be used effectively to increase performance of Radix Selection. This is effective because CPU caches have been increasing over the time and are now commonly over 1 MBytes, which is large enough to hold the entire counting array during the counting phase of Radix algorithms. Counting array is randomly accessed for input array that has random data in it. Thus, having the counting array fit entirely in cache boosts performance significantly.

As CPU caches have grown, so can the size of the counting array and thus more bits per digit. For example, using 16-bits/digit when processing an array of 32-bit keys, would take only two passes, while using 8-bits/digit takes four passes. Thus, a potential of performance increasing by 2X exists when using 16-bits/digit versus 8-bits/digit.

This performance optimization works well because the permutation phase of Radix Selection consists of reading sequentially and writing sequentially to a single bin.

Counting While Writing

A new performance optimization has been recently introduced by the “Onesweep Radix Algorithm” which counts while writing. Until this optimization, Radix algorithms consisted of two passes over the array data: counting and permuting. The Onesweep optimization combines writing and counting into a single pass over the array data.

This optimization is effective for Radix Selection because the algorithm writes to a single bin per pass over the array, or a portion of the array as it narrows down on the result. The number of memory operations (reads of data from system memory) is reduced, which is the performance limiter.

Leave a comment