This blog progressively develops a simple parallel algorithm described in Chapter 6 of a soon to be released book, as a fun parallelism exercise.

The goal of this exercise is to sort a large array of unsigned bytes as quickly as possible, using multiple cores if that provides performance improvement, in-place if possible, targeting the best performance and order.

Introspective Sort or Merge Sort could be used and are ** O**(NlgN). Insertion Sort is

**(N**

*O*^{2}), which is too slow for large arrays. In-Place Merge Sort is

**(Nlg**

*O*^{2}N), which is also too slow. MSD Radix Sort is

**(N), which is the lowest order, and serves as a possible starting idea since it is in-place. A single digit portion of the algorithm could be stripped out since digits are commonly bytes. It performs counting to create bin, followed by permutation which is in-place via swapping of elements. This seems like a lot more complexity than may be necessary for this specific problem.**

*O***Algorithm**

One possible simple method developed in [14, 15, 16] works as follows:

- Read each byte of the array, counting each value (256 possible values) using the Count algorithm
- Write to the original array as many 0’s as were counted, followed by as many 1’s as were counted, and so on, for all the counts, up to the maximum value of 255

For example, the following array of bytes is to be sorted:

1, 1, 3, 2, 1, 3, 3, 2, 1, 2, 1

There are zero 0’s, five 1’s, three 2’s, and three 3’s. Thus, count[0] is 0, count[1] is 5, count[2] is 3, and count[3] is 3. The last step is to write out zero 0’s, five 1’s, three 2’s, and three 3’s back into the original array:

1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3

The result is a sorted array of bytes.

This algorithm is in-place, since only an additional array of 256 counts is necessary to hold all possible counts. The size of the count array stays constant no matter how large the array is being sorted, making it a constant in space usage. The algorithm is ** O**(N) since it reads each array element once during the counting phase and writes each array element once during the writing phase.

Both phases of the algorithm are highly parallelizable since counting the array can be split into sub-arrays with counting proceeding in parallel for each sub-array, followed by combining all the counts into a single count array as is done in Parallel Count. Writing can be parallelized in several ways also.

**Theoretical Performance**

This algorithm reads and writes system memory sequentially. It reads the entire array and then writes the entire array, performing 2N memory accesses, where N is the size of the array. In theory, performance should be limited by system memory bandwidth.

Processor | Bandwidth GigaBytes / sec | Parallel Counting Sort |

6-core i7 9750H | 41.8 | 20 GigaBytes/sec |

24-core 8275CL | 262 | 130 GigaBytes/sec |

In theory, the algorithm should sort an array of bytes at speeds approaching half of system memory bandwidth. In theory…

**Serial Algorithm**

The following C# serial algorithm serves as a good starting point for an in-place implementation:

```
public static int[] Histogram(this byte[] inArray)
{
const int numberOfBins = 256;
int[] counts = new int[numberOfBins];
for (uint currIndex = 0; currIndex < inArray.Length; currIndex++)
counts[inArray[currIndex]]++;
return counts;
}
public static void Fill<T>(this T[] arrayToFill, T value, int startIndex, int length)
{
int endIndex = startIndex + length;
for (int i = startIndex; i < endIndex; i++)
arrayToFill[i] = value;
}
public static void SortCountingInPlace(this byte[] arrayToSort)
{
int[] counts = arrayToSort.Histogram();
int startIndex = 0;
for (uint countIndex = 0; countIndex < counts.Length; countIndex++)
{
arrayToSort.Fill((byte)countIndex, startIndex, counts[countIndex]);
startIndex += counts[countIndex];
}
}
```

The above implementation has two phases: Count (Histogram) and Fill. It only allocates an array for the Count/Histogram results, which is 256 integers. The following Table compares performance to C# Array.Sort running on 6-core i7 9750H processor in millions of bytes per second:

Algorithm | Random | Presorted | Constant |

Array.Sort | 19 | 19 | 19 |

SortCounting | 671 | 672 | 672 |

C++ in-place algorithm implementation is shown below:

```
inline void counting_sort(unsigned char* array_to_sort, size_t l, size_t r)
{
size_t counts[256]{};
for (size_t _current = l; _current < r; _current++)
counts[array_to_sort[_current]]++;
size_t start_index = 0;
for (size_t count_index = 0; count_index < 256; count_index++)
{
size_t end_index = start_index + counts[count_index];
for (size_t i = start_index; i <= end_index; i++)
array_to_sort[i] = count_index;
start_index += counts[count_index];
}
}
```

Table below shows it’s performance:

Algorithm | Random | Presorted | Constant |

std::sort | 32 | 592 | 1686 |

counting_sort | 1317 | 1340 | 596 |

counting_sort w/ std::fill | 1942 | 1996 | 700 |

C++ implements two variations for the writing of results phase: for loop, and std::fill. The version using std::fill runs significantly faster, most likely because std::fill is implemented using SIMD/SSE instructions. Substantially lower performance, of nearly 3X, for constant arrays is puzzling.

Performance measurements of the Counting phase and Write phase of the algorithm would be useful, to see where the algorithm is spending majority of its time. Table below shows this breakdown for the C++ implementation, in MegaBytes/second:

Phase of C++ Implementation | Random | Presorted | Constant |

Counting Phase | 2100 | 2096 | 711 |

Write Phase w/ for loop | 3466 | 3378 | 3378 |

Write Phase w/ std::fill | 19357 | 17762 | 18762 |

These measurements show that most of the time is spent in Counting. When std::fill is utilized in the Write phase, half of the theoretical available bandwidth on the 6-core i7 9750H computer is achieved. This method (std::fill) is over 5X faster than using a for loop.

**Theory**: Counting of a constant array is slower because it increments the same location within the count array for each byte that is read in from the source array. This is a dependency in the Counting for loop. Random and Presorted distributions increment different locations within the count array for each source array element, and don’t have this type of a dependency.

**Possible Workaround**: Use several count arrays (e.g., 2 count arrays) and alternate which of the multiple count arrays is used. For instance, with 2 count arrays, for all array elements with an even index the first count array is used. For all elements with an odd index the second count array is used. After counting has completed the two count arrays are combined to return a single count array.

Let’s explore parallelism before addressing this constant array performance problem, as parallelism has the potential to provide a significant performance boost.

## Parallel Algorithm

Since Counting is the performance limiter, implementing a parallel version should provide the most performance gain. The following C++ implementation breaks the problem down into left and right half recursively, running each half in parallel:

```
template< unsigned long PowerOfTwoRadix >
inline size_t* HistogramOneByteComponentParallel(unsigned char inArray[], size_t l, size_t r, size_t parallelThreshold = 16 * 1024)
{
const unsigned long numberOfBins = PowerOfTwoRadix;
size_t* countLeft = NULL;
size_t* countRight = NULL;
if (l >= r) // zero elements to compare
{
countLeft = new size_t[numberOfBins]{};
return countLeft;
}
if ((r - l) <= parallelThreshold)
{
countLeft = new size_t[numberOfBins]{};
for (size_t current = l; current < r; current++) // Count
countLeft[inArray[current]]++;
return countLeft;
}
size_t m = r / 2 + l / 2 + (r % 2 + l % 2) / 2; // average without overflow
Concurrency::parallel_invoke(
//tbb::parallel_invoke(
[&] { countLeft = HistogramOneByteComponentParallel <PowerOfTwoRadix>(inArray, l, m, parallelThreshold); },
[&] { countRight = HistogramOneByteComponentParallel <PowerOfTwoRadix>(inArray, m, r, parallelThreshold); }
);
// Combine left and right results
for (size_t j = 0; j < numberOfBins; j++)
countLeft[j] += countRight[j];
delete[] countRight;
return countLeft;
}
```

Parallelism is achieved by using TBB or PPL parallel_invoke() to run multiple tasks in parallel. Two tasks are run in parallel: counting of the left half of the array and counting of the right half of the array. The resulting count arrays, countLeft and countRight, are combined into the countLeft array, which is returned as the result.

Only the leaf nodes of recursion perform the actual counting work. Allocation of new counting arrays is also done only in the leaf nodes of recursion. The rest of the recursion tree combines left and right counting arrays into a single counting array. At the top of the recursion tree a single counting array is finally returned.

Parallel Counting performance by itself is shown in the following Table with units in MegaBytes/second running on 6-core i7 9750H processor:

Phase of C++ Implementation | Random | Presorted | Constant |

Parallel Counting Phase | 7876 | 8683 | 5076 |

Parallel Performance Increase | 3.75 | 4.14 | 7.14 |

Counting phase by itself is scaling from 3.75X to 7X on a 6-core processor. Array with constant values is still lagging in performance.

Table below shows a benchmark for the entire Parallel Sort C++ implementation, with units of MegaBytes/second:

Algorithm | Random | Presorted | Constant |

Parallel Sort w/ Parallel Counting Phase and std::fill | 7249 | 7304 | 4386 |

Parallel std::sort | 98 | 206 | 2141 |

Performance of this parallel algorithm is 74X faster than C++ Parallel std::sort when sorting a random array of bytes on a 6-core CPU. It is 3.7X faster than the Serial Counting Sort, when parallelizing only the Counting phase of the algorithm.

### Parallel Writing

The first part of parallelizing the Write phase is to develop a Parallel Fill. C++ std::fill provides the interface to support a parallel version, but some current compilers do not support parallelization of it. The following is an implementation of a Parallel Fill:

```
inline void parallel_fill(unsigned char* src, unsigned char value, size_t l, size_t r, size_t parallel_threshold = 16 * 1024)
{
if (r <= l)
return;
if ((r - l) < parallel_threshold)
{
std::fill(src + l, src + r, value);
return;
}
size_t m = r / 2 + l / 2 + (r % 2 + l % 2) / 2; // average without overflow
Concurrency::parallel_invoke(
//tbb::parallel_invoke(
[&] { parallel_fill(src, value, l, m, parallel_threshold); },
[&] { parallel_fill(src, value, m, r, parallel_threshold); }
);
}
```

Performance of the Parallel Fill in MegaBytes/second is shown in the following Table:

Phase of C++ Implementation | Random | Presorted | Constant |

Parallel Fill Phase (2-cores) | 23,117 | 23,705 | 23,646 |

Parallel Fill Phase (6-cores) | 24,022 | 23,422 | 21,628 |

Parallel Performance Increase | 1.19 | 1.33 | 1.26 |

Parallel Fill performance increases by 19-33% by using 2-cores over the Serial version and does not increase significantly when adding more cores on a 6-core processor.

### Improving Counting

Sorting an array filled with constant values lags in performance for Serial and Parallel implementations of Counting Sort. To improve the Counting phase of the algorithm, multiple counting arrays can be employed as discussed in the Theory above, shown in the following code snippet:

```
countLeft_0 = new size_t[numberOfBins]{};
countLeft_1 = new size_t[numberOfBins]{};
countLeft_2 = new size_t[numberOfBins]{};
countLeft_3 = new size_t[numberOfBins]{};
size_t last_by_four = l + ((r - l) / 4) * 4;
size_t current = l;
for (; current < last_by_four;)
{
countLeft_0[inArray[current]]++; current++;
countLeft_1[inArray[current]]++; current++;
countLeft_2[inArray[current]]++; current++;
countLeft_3[inArray[current]]++; current++;
}
for (; current < r; current++) // possibly last element
countLeft_0[inArray[current]]++;
```

the above code is the core of the overall improved Counting implementation, where four counting arrays are used to remove the dependency within the counting loop. Instead of incrementing the same memory location within the count array, when sorting an array filled with a constant value, the same index is incremented in four separate arrays per iteration of the for loop.

After counting the entire array, one element at a time, the count arrays are combined into a single count array, which is then returned as shown below:

```
for (size_t count_index = 0; count_index < numberOfBins; count_index++)
{
countLeft_0[count_index] += countLeft_1[count_index];
countLeft_0[count_index] += countLeft_2[count_index];
countLeft_0[count_index] += countLeft_3[count_index];
}
```

the above code uses four counting arrays to remove the dependency within the counting loop. Instead of incrementing the same memory location within a single count array, when sorting an array of constants, the same index is incremented in four separate arrays per iteration of the for loop, breaking the dependency across loop iterations.

After counting of the array has been finished, the count arrays are combined into a single count array, which is then returned as shown below:

```
for (size_t count_index = 0; count_index < numberOfBins; count_index++)
{
countLeft_0[count_index] += countLeft_1[count_index];
countLeft_0[count_index] += countLeft_2[count_index];
countLeft_0[count_index] += countLeft_3[count_index];
}
```

This modification brings performance of the Counting phase for constant arrays to the same level as Counting of random arrays.

## Performance on 96-Core Processor

Let’s see how the algorithm, with Parallel Count and Parallel Fill, performs on a 96-core AWS c5.24xlarge node, which consists of dual-CPUs with 24 hyperthreaded cores each. Below is the Parallel Counting Sort implementation in C++:

```
template< unsigned long NumberOfBins >
inline void counting_sort_parallel_inner(unsigned char *array_to_sort, size_t l, size_t r, size_t threshold_count = 64 * 1024, size_t threshold_fill = 64 * 1024)
{
size_t* counts = HistogramOneByteComponentParallel_3< NumberOfBins >(array_to_sort, l, r, threshold_count);
size_t start_indexes[NumberOfBins];
start_indexes[0] = 0;
for (size_t count_index = 1; count_index < NumberOfBins; count_index++)
start_indexes[count_index] = start_indexes[count_index - 1] + counts[count_index - 1];
Concurrency::parallel_for(size_t(0), size_t(NumberOfBins), [&](size_t count_index) {
parallel_fill(array_to_sort, (unsigned char)count_index, start_indexes[count_index], start_indexes[count_index] + counts[count_index], threshold_fill);
});
delete[] counts;
}
```

The above implementation uses Parallel Count (Histogram), which uses multiple count arrays to improve performance for constant arrays. It also uses parallel_fill, constructed with std::fill, which uses SIMD/SSE instructions for higher performance memory writes. All the bins can be filled in parallel, which is done using parallel_for capability of TBB and PPL. The parallel_fill is called in parallel for each bin, providing additional parallelism.

Performance on a 96-core AWS node for each phase of the Parallel Counting Sort is show below, in MegaBytes/second:

Phase of C++ Implementation | Random | Presorted | Constant |

Serial Counting Phase | 2028 | 2081 | 669 |

Parallel Counting Phase (96 cores) | 39534 – 60608 | 56308 – 60257 | 62286 – 65304 |

Parallel Performance Increase | 19X – 30X | 27X – 30X | 93X – 98X |

Parallel Counting scales by at least 19X over Serial Counting on the 96-core machine, and by 4X over running on the 6-core machine. Performance of the Parallel Counting phase is exhibiting more variation from run to run. This needs investigation into the root cause of variation, to see if it can be improved.

Phase of C++ Implementation | Random | Presorted | Constant |

Serial Fill Phase | 7231 | 7234 | 7228 |

Parallel Fill Phase (96 cores) | 51230 | 50862 | 45308 |

Parallel Performance Increase | 7X | 7X | 6.3X |

Parallel Fill scales by over 6X over Serial Fill, and by 2X over running on 6-core machine.

Algorithm | Random | Presorted | Constant |

Serial std::sort | 30 | 56 | 1565 |

Parallel std::sort | 95 | 300 | 1994 |

Parallel Performance Increase | 3.2 | 5.4 | 1.3 |

Algorithm | Random | Presorted | Constant |

Serial Counting Sort | 1584 | 1618 | 612 |

Parallel Counting Sort (96 cores) | 22645 – 28205 | 26018 – 27912 | 25063 – 27729 |

Parallel Performance Increase | 14X – 18X | 16X – 17X | 41X – 45X |

The overall Parallel Counting Sort scales by at least 14X over Serial Counting Sort on the 96-core machine, and by 3X over running on 6-core machine. C++ std::sort scales much less for 1 Billion byte array sorting. Parallel Counting Sort is over 238X faster than Parallel std::sort for random byte arrays, which is the slowest case. Parallel Counting Sort performance is more consistent across data distributions.

Performance achieved so far peaks at 28 GigaBytes/second out of the theoretical maximum of 130 GigaBytes/second, leaving over 4X performance improvement opportunity. To increase performance further, memory read and memory write performance need to be closer to their theoretical limits.

## Sum on 96-Core Processor

The first phase of Parallel Counting Sort is spent reading and counting the bytes of the array. Speeding up this operation further would increase performance. One way to test performance of memory reading is to implement a Parallel Summation. Summing elements of an array requires reading each element once, producing a single result at the end, as shown in the following serial C++ implementation:

```
// left (l) boundary is inclusive and right (r) boundary is exclusive
inline unsigned long long Sum(unsigned long long in_array[], size_t l, size_t r)
{
unsigned long long sum = 0;
for (size_t current = l; current < r; current++)
sum += in_array[current];
return sum;
}
```

C++ provides std::accumulate, which is equivalent to the above implementation. However, no parallel support is offered.

Summation parallelizes well, where an array can be split into pieces, with each summed in parallel, followed by adding sums of each piece together to produce a single sum result. Below is a Parallel Sum implementation:

```
inline unsigned long long SumParallel(unsigned long long in_array[], size_t l, size_t r, size_t parallelThreshold = 16 * 1024)
{
if ((r - l) <= parallelThreshold)
return Sum( in_array, l, r );
unsigned long long sum_left = 0, sum_right = 0;
size_t m = r / 2 + l / 2 + (r % 2 + l % 2) / 2; // average without overflow
tbb::parallel_invoke(
[&] { sum_left = SumParallel(in_array, l, m, parallelThreshold); },
[&] { sum_right = SumParallel(in_array, m, r, parallelThreshold); }
);
// Combine left and right results
sum_left += sum_right;
return sum_left;
}
```

The above code splits the input array into left and right half recursively, until the sub-array gets small enough to sum all the elements using a Serial Sum for loop. The two halves are executed in parallel. Their resulting sums (sum_left and sum_right) are combined to create a single sum_left result. The input array is of 64-bit unsigned integers, producing a 64-bit unsigned sum. Arithmetic overflow is not a concern for the purpose of measuring memory bandwidth. For safer Parallel Sum implementations supporting all numeric data types without arithmetic overflow see HPCsharp C# nuget package.

The following Table shows performance of C++ Parallel Sum in GigaBytes/second on a 96-core computer:

Parallel Sum | Random | Presorted | Constant |

Parallel Sum (unsigned 64-bit) | 70 – 76 | 74 – 76 | 73 – 76 |

Parallel Sum (unsigned 8-bit) | 41 – 63 | 64 – 68 | 54 – 68 |

Parallel Performance Variation | 1.1 – 1.5 | 1.0 – 1.1 | 1.0 – 1.3 |

Two variations of algorithm are compared above: array of 64-bit unsigned integers and array of 8-bit unsigned integers, summed to a 64-bit unsigned result in both cases. Reading 64-bit values from memory versus reading 8-bit values, improves performance on the low and high ends of the range.

Using a subset of available cores leads to higher performance for some algorithms, as discussed in “Too Many Cooks in The Kitchen”. One way to limit the number of cores being used is to do the following:

`sum = ParallelAlgorithms::SumParallel(u64Array, 0, ulongs.size(), ulongs.size() / 24);`

the last argument sets the parallelThreshold to be 24-th of the array size. This breaks the array into 32 sub-arrays, for an array of 1 billion elements sub-dividing in half during recursion, limiting parallelism to 32 cores. The following Table shows resulting performance:

Parallel Sum | Random | Presorted | Constant |

Parallel Sum (unsigned 64-bit) | 43 – 114 | 43 – 114 | 43 – 113 |

Parallel Sum (unsigned 8-bit) | 45 – 107 | 45 – 106 | 45 – 107 |

Parallel Performance Variation | 2.7 | 2.7 | 2.7 |

Running the benchmark application one time leads to the highest performance within the shown range (114 GigaBytes/sec), which is higher than running on all 96 cores. Running the same application again leads to the lowest performance within the shown range (45 GigaBytes/sec), which is lower than running on all 96 cores. This pattern continues when executing the benchmark application again and again on the 96-core AWS node. In other words, performance alternates by 2.7X every other time of executing this application. This variation in execution performance is puzzling, requiring further investigation.

Interestingly, all 48-cores within CPU 0 are being used during all the executions. Also, all cores within CPU 2 of this dual-CPU machine are not executing any work.

**Theory**: Only 32-tasks are created when parallelThreshold is set to 1/24-th of the array size. Each processor has 24 physical cores, each able to run two threads via hyperthreading. During the run of the benchmarking application with the highest performance, most of the tasks get mapped to run as one task per core. During the run with the lowest performance, most of the tasks get mapped to run as two tasks per core. Since both tasks on a core a vying for the same limited resource (system memory), hyperthreading is slowing performance down.

**Another Data Point**

Using 16 cores provides higher performance than using 24 cores – i.e. setting parallelThreshold to be 1/16-th of the array size. Performance then oscillates between 71 and 167 Gigabytes/second when doing parallel summation.

Newer version of TBB supports limiting one task per core.

#### Summary

For sorting an array of bytes, an In-Place Parallel Sorting algorithm has been developed, which performs at least 238X faster than Parallel std::sort on 96-core computer, and at least 74X on 6-core computer. Performance order is ** O**(N), since it reads each array element once and writes each array element once, versus

**(NlgN) for std::sort. Performing one read and one write operation to the array being sorted is minimal amount of memory operations necessary to permute (sort) an array.**

*O*Further investigation into increasing memory read performance was done by implementing a Parallel Sum algorithm, which showed that only a single processor on a two-processor system was being used when reaching peak read memory bandwidth. This shows further opportunities in performance improvements, and application of lessons learned to other parallel algorithms.

**Further Development**

Performance optimization is never ending story. Further optimizations are being developed in this blog entry. Come and help get closer to the theoretical maximum.

### Further Ideas

Read memory bandwidth on a dual-CPU 96-core computer system seems to peak when 32 tasks (threads) are used. This reaches about half of the available read bandwidth. However, these 32 tasks are currently running only on CPU 0, with CPU 1 being idle. Potentially higher read bandwidth could be reached if the tasks were to be balanced across both CPUs.

Hyperthreading is a wonderful concept when parallel tasks use different shared resources. However, when parallel tasks are after the same shared resource, such as system memory, hyperthreading slows performance down. Potentially higher performance could be reached by turning hyperthreading off. TBB provides a way to limit each core to running only a single task/thread.

Writing performance could potentially be improved by using SIMD/SSE instructions that go around the cache – i.e., don’t evict items from cache. This could be beneficial when arrays larger than cache size.

### Source Code Repositories

For C#: https://github.com/DragonSpit/HPCsharp

For C++: https://github.com/DragonSpit/ParallelAlgorithms

C++ repository has working code for all code samples shown in this blog.