This blog explores methods to reach maximum read bandwidth. This is a useful basic operation which limits performance of many algorithms serial or parallel. Knowing how to reach the maximum available read bandwidth is beneficial in many instances.
One way to test performance of memory reading is to implement a Summation. Summing elements of an array requires reading each element once, producing a single result at the end.
Serial Summation
Simple summation is shown in the following serial C++ implementation, using a for loop:
// left (l) boundary is inclusive and right (r) boundary is exclusive
inline unsigned long long Sum(unsigned long long in_array[], size_t l, size_t r)
{
unsigned long long sum = 0;
for (size_t current = l; current < r; current++)
sum += in_array[current];
return sum;
}
C++ provides std::accumulate, which is equivalent to the above implementation. However, no parallel support is offered.
The following Table shows performance of the above Serial Sum in GigaBytes/second on a laptop computer with a i7-12700H 14-core processor :
Parallel Sum | Random | Presorted | Constant |
Serial Sum | 14 – 17 | 13 – 17 | 13 – 17 |
Serial std::accumulate | 19 – 21 | 19 – 23 | 22 – 24 |
Performance Variation |
Parallel Summation
Summation parallelizes well, where an array can be split into pieces, with each summed in parallel, followed by adding sums of each piece together to produce a single sum result. Below is a Parallel Sum implementation:
inline unsigned long long SumParallel(unsigned long long in_array[], size_t l, size_t r, size_t parallelThreshold = 16 * 1024)
{
if ((r - l) <= parallelThreshold)
return Sum( in_array, l, r );
unsigned long long sum_left = 0, sum_right = 0;
size_t m = r / 2 + l / 2 + (r % 2 + l % 2) / 2; // average without overflow
tbb::parallel_invoke(
[&] { sum_left = SumParallel(in_array, l, m, parallelThreshold); },
[&] { sum_right = SumParallel(in_array, m, r, parallelThreshold); }
);
// Combine left and right results
sum_left += sum_right;
return sum_left;
}
The above code splits the input array into left and right half recursively, until the sub-array gets small enough to sum all the elements using a Serial Sum for loop. The two halves are executed in parallel. Their resulting sums (sum_left and sum_right) are combined to create a single sum_left result. The input array is of 64-bit unsigned integers, producing a 64-bit unsigned sum. Arithmetic overflow is not a concern for the purpose of measuring memory bandwidth. For safer Parallel Sum implementations supporting all numeric data types without arithmetic overflow see HPCsharp C# nuget package.
The following Table shows performance of C++ Parallel Sum in GigaBytes/second on the 14-core laptop computer:
Parallel Sum | Random | Presorted | Constant |
Parallel Sum (unsigned 64-bit) | 43 – 54 | 48 – 53 | 45 – 54 |
Parallel Performance Variation |
The following Table shows performance of C++ Parallel Sum in GigaBytes/second on a 48-core computer:
Parallel Sum | Random | Presorted | Constant |
Parallel Sum (unsigned 64-bit) | 70 – 76 | 74 – 76 | 73 – 76 |
Parallel Sum (unsigned 8-bit) | 41 – 63 | 64 – 68 | 54 – 68 |
Parallel Performance Variation | 1.1 – 1.5 | 1.0 – 1.1 | 1.0 – 1.3 |
Two variations of algorithm are compared above: array of 64-bit unsigned integers and array of 8-bit unsigned integers, summed to a 64-bit unsigned result in both cases. Reading 64-bit values from memory versus reading 8-bit values, improves performance on the low and high ends of the range.
Using a subset of available cores leads to higher performance for some algorithms, as discussed in “Too Many Cooks in The Kitchen”. One way to limit the number of cores being used is to do the following:
sum = ParallelAlgorithms::SumParallel(u64Array, 0, ulongs.size(), ulongs.size() / 24);
the last argument sets the parallelThreshold to be 24-th of the array size. This breaks the array into 32 sub-arrays, for an array of 1 billion elements sub-dividing in half during recursion, limiting parallelism to 32 cores. The following Table shows resulting performance:
Parallel Sum | Random | Presorted | Constant |
Parallel Sum (unsigned 64-bit) | 43 – 114 | 43 – 114 | 43 – 113 |
Parallel Sum (unsigned 8-bit) | 45 – 107 | 45 – 106 | 45 – 107 |
Parallel Performance Variation | 2.7 | 2.7 | 2.7 |
Running the benchmark application one time leads to the highest performance within the shown range (114 GigaBytes/sec), which is higher than running on all 96 cores. Running the same application again leads to the lowest performance within the shown range (45 GigaBytes/sec), which is lower than running on all 96 cores. This pattern continues when executing the benchmark application again and again on the 96-core AWS node. In other words, performance alternates by 2.7X every other time of executing this application. This variation in execution performance is puzzling, requiring further investigation.
Interestingly, all 48-cores within CPU 0 are being used during all the executions. Also, all cores within CPU 2 of this dual-CPU machine are not executing any work.
Theory: Only 32-tasks are created when parallelThreshold is set to 1/24-th of the array size. Each processor has 24 physical cores, each able to run two threads via hyperthreading. During the run of the benchmarking application with the highest performance, most of the tasks get mapped to run as one task per core. During the run with the lowest performance, most of the tasks get mapped to run as two tasks per core. Since both tasks on a core a vying for the same limited resource (system memory), hyperthreading is slowing performance down.
Another Data Point
Using 16 cores provides higher performance than using 24 cores – i.e. setting parallelThreshold to be 1/16-th of the array size. Performance then oscillates between 71 and 167 Gigabytes/second when doing parallel summation.
Newer version of TBB supports limiting one task per core.