Improving Parallel Performance for Small Arrays

In my previous blogs, Standard C++ Parallel Algorithms were shown to accelerate well on multi-core processors for large arrays, but slowed performance down for small arrays. In this blog, let’s explore a way to increase performance for small arrays.

Data Parallelism

Many Standard C++ Algorithms support several execution modes. C++ standard supports four modes: seq, unseq, par, and par_unseq – described in C++ Parallel STL Benchmark blog. Only “seq” and “par” modes were used in benchmarks in the previous blogs. The two “unseq” modes add data parallelism, which some algorithms and compilers (but not all) implement, using SIMD/SSE instructions. For instance, Microsoft compiler does not implement unseq for most of the algorithms, while Intel compiler does.

unseq implements algorithms using SIMD/SSE data parallel instructions, which process multiple array elements using a single instructions. For instance, 512-bit wide SSE instructions can process sixteen 32-bit integers in a single instruction, whereas scalar instructions processes one 32-bit integer per instruction. Using a single-core unseq implementation can process several data items in parallel, while par_unseq uses multiple cores along with data parallelism.

Benchmark

The following benchmarks compare single-core and multi-core performance for the four implementations using Intel compiler on 14-core i7-12700H processor:

AlgorithmArray
Size
seqStd
Dev
unseqStd
Dev
Speedup
all_of(dpl::1,0002,0774318,3102,3844.0
all_of(dpl::5,0002,3795949,5481,1124.0
all_of(dpl::10,0002,8038539,4311,3893.4
all_of(dpl::20,0003,4128309,2941,7122.7
all_of(dpl::40,0003,1998849,3941,5762.9
all_of(dpl::100,0003,2757389,8051,4643.0
all_of(dpl::200,0003,8397738,3789142.2
Single-core performance (above)

AlgorithmArray
Size
parStd
Dev
par_unseqStd
Dev
Speedup
par_unseq vs unseq
all_of(dpl::1,00010029103290.01
all_of(dpl::5,0004851434571300.05
all_of(dpl::10,0009052769222610.10
all_of(dpl::20,0001,8174901,9714300.21
all_of(dpl::40,0003,0108283,3528910.36
all_of(dpl::100,0006,8121,3328,4181,6440.86
all_of(dpl::200,00010,5621,87113,9902,9481.67
Multi-core performance (above)

Performance units in the above tables are millions of 32-bit integers per second. The algorithm used is all_of(). Intel implementations for Standard C++ Algorithm is used.

The first table shows that performance on single-core using unseq (SIMD/SSE) implementation always provides acceleration. Acceleration is always provided from data parallel SIMD/SSE instructions, on a single-core, from 2.2X to 4X.

The second table shows that parallel performance, whether par or par_unseq, is slower than single-core for arrays of 100K elements or smaller. For par and par_unseq (SIMD/SSE) performance for small arrays is up to 100X worse for small arrays. Multi-core (par) and multi-core SIMD/SSE (par_unseq) perform nearly equally poorly for small arrays. Parallel overhead is most likely the overwhelming cause of this performance degradation for small arrays.

For arrays 200K or larger, both par and par_unseq outperform both single-core implementations (seq and unseq). Multi-core SIMD/SSE (par_unseq) is the overall performance leader for large arrays, but not as significantly as for single-core.

Perform No Worse

Highest performing algorithm for small arrays can be combined with the best one for large arrays, to achieve superior performance for all array sizes. For small arrays, single-core unseq (SIMD/SSE) implementation has the highest performance. For large arrays, multi-core par_useq (SIMD/SSE) implementation performs best. The following implementation combines these two algorithms:

if (input_array.size() <= 100000)
{
    if (all_of(oneapi::dpl::execution::unseq, input_array.begin(), input_array.end(), []( ... ) { .... }))
        ...
}
else
{
    if (all_of(oneapi::dpl::execution::par_unseq, input_array.begin(), input_array.end(), []( ... ) { .... }))
        ...
}

The resulting combination performs many times faster than single-core seq implementation all array sizes – small and large. This hybrid approach achieves the “Perform No Worse” principle, where parallelism never degrades performance, making it simpler to think about and to use in all cases: small or large. It is simply always faster.

For small arrays, data parallelism (unseq) is used, running on a single-core. For large arrays, data parallelism is used, running on multiple cores (par_unseq) as well. The resulting combination is several times faster than non-parallel single-core implementation (seq).

Another Way To Improve Performance

Other ways to improve performance for small arrays are possible. One such method is proposed in “Practical Parallel Algorithms in C++ and C#: Part 1: Sorting on Multicore CPUs” book. The method ramps up parallelism as the problem size increases. By applying fewer cores for smaller arrays and more cores for larger arrays, the problem of “Too Many Cooks In the Kitchen” is avoided, scaling parallel acceleration well for all array sizes.

This method is not not currently supported by Standard C++ Parallel Algorithm. It is simple to implement and is included in ParallelAlgorithms repo (C++) and HPCsharp repo (C#).

Deeper discussion on the method described in this blog is covered in “Can C++ Parallel Standard Algorithms Accelerate, Even Small Arrays?” blog and “Parallel Acceleration at Small Scale” blog.

2 thoughts on “Improving Parallel Performance for Small Arrays

Leave a comment