Can C++ Parallel Standard Algorithms Accelerate, Even Small Arrays?

My previous blog, C++ Parallel STL Benchmark, showed performance for all measured C++ Parallel Standard algorithms increased over sequential single-core implementations. Some algorithms scaled much better than others – by nearly 10X on a 14-core processor and over 20X on a 48-core. Only large arrays with 100 million integers were used for these benchmarks.

Let’s see how fast C++ Parallel algorithms process small, medium and large arrays. Table below shows parallel and sequential performance of one algorithm, all_off():

AlgorithmArray
Size
seqStd
Dev
parStd
Dev
Parallel
Speedup
all_of(std::1,000370783891350.025
all_of(dpl::1,0002640756115330.044
all_of(std::10,00040683746772630.17
all_of(dpl::10,00031449795922100.19
all_of(std::100,000415237033294910.8
all_of(dpl::100,0003533988641315981.8
all_of(std::1,000,00043003231624019703.8
all_of(dpl::1,000,00042893532325730335.4
all_of(std::10,000,00032902742126635986.5
all_of(dpl::10,000,00028473702450237848.6
all_of(std::100,000,00031564921319410494.2
all_of(dpl::100,000,0004054192138086763.4

The above benchmark was run in VisualStudio 2022. Laptop Windows 11 Power Options setting was set to Best Performance. Each algorithm was run 10K times to provide an average run time over many executions, along with variation shown by standard deviation. The std:: implementation is Microsoft’s parallel STL and dpl:: is Intel’s.

Parallel algorithms are as much as 40X slower than sequential version for small arrays. For arrays of 10K and smaller, parallel implementation slow performance down. For arrays of 1 million and larger, parallel version outperforms the sequential version consistently and substantially on the 14-core laptop processor. Between 10K and 1 million is a middle area where at times sequential version performs better and at other times parallel does.

Let’s explore possible reasons for the lack of performance and a suggested solution outlined in the “Practical Parallel Algorithms in C++ and C#” book.

Using Parallel Algorithms Just Got Complicated

From algorithms users perspective, applying C++ Parallel algorithms added complexity, since performance is not always higher for parallel implementation. Algorithms users need to determine the minimal size of arrays when parallel algorithm can be used, otherwise sequential implementation needs to be used. This threshold would need to be determined for each algorithm, and most likely for each target computer system.

From parallel algorithm developers perspective, it is disappointing that parallel version only provides performance benefit for large arrays, limiting and complicating usage. Not all problems are large. Now algorithm developers need to explain why this is happening and caution algorithm users about their proper usage.

Ideally, parallel algorithms should “perform no worse” not only for large arrays but also for medium and small arrays – i.e. for any array size. This ideal would simplify usage for parallel algorithms for all parties.

The Problem

Parallel implementations have substantial overhead of task setup and task switching. This overhead is hidden for large problems, since it is small relative to the amount of work within a large problem. However, for small problems (arrays) the overhead becomes substantial, and even larger than the work itself.

By default, Intel and Microsoft Parallel implementations use all of the cores available no matter the problem size. This leads to “too many cooks in the kitchen” situation, where all of the cores get a tiny task to work on and get in each others way while trying to get it done.

An example of too small of a problem in the kitchen is to have a task of slicing a single tomato for several cooks. Each cook makes a single slice in the tomato, as a turn at the task. It is more efficient and faster to have a single cook slice the entire tomato. In other words, some problems are too small for all of the cooks to be involved. For small problems a single cook is the fastest way. In this case, having all of the cooks slows things down.

A Scalable Solution to Simplify Usage

The measurements in the above Table show:

  • for small problems using a single core is fastest
  • for large problems using all cores is fastest

For the medium arrays somewhere in between single core and all of the cores is needed.

One possible idea, suggested in the “Practical Parallel Algorithms in C++ and C#” book, is to scale linearly the number of cores applied from small arrays to large arrays. For example, let’s say that arrays with less than 10K elements have been determined to be fastest when a single core is used, and arrays of 10K and larger using two cores is faster. Then for array sizes 10K – 20K array two cores are applied, for 20K – 30K array three cores, 40K – 50K array four cores, and so on, until all cores are used for large arrays.

In other words, for small arrays a single core is used, for large arrays all of the cores are used, and for medium size arrays more than one and less than all of the cores are applied. The number of workers is scaled with the size of the problem. The bigger the problem the more workers get applied.

Testing Scalability

The following Table shows performance measurements for linear scaling of array size and the number of cores:

AlgorithmCoresArray
Size
parStd
Dev
par
+1 core
par
+2 cores
par all
cores
all_of(dpl::120,0003351998307832291803
all_of(dpl::240,0003918739433643213570
all_of(dpl::360,0005555802526153844571
all_of(dpl::480,0006005922628564025684
all_of(dpl::5100,00065721263743773426845
all_of(dpl::6120,00079341479814776927680
all_of(dpl::7140,00087221492863986478524
all_of(dpl::8160,00092461176929095529519
all_of(dpl::9180,000958013659965102499968
all_of(dpl::10200,000106541300107271075910886
all_of(dpl::11220,000113491415114521151511310
all_of(dpl::12240,000117431572119961229211917
all_of(dpl::13260,000126111487126581273112567
all_of(dpl::14280,000131121558130741244313060
all_of(dpl::15300,000136021599129791294713478
all_of(dpl::16320,000135341677135551348414016
all_of(dpl::17340,000139981964139051395214419
all_of(dpl::18360,000146432214142421412614632
all_of(dpl::19380,000147062349145891448314991
all_of(dpl::20400,000152542253151371500615165
  • “par” is the implementation, using the number of cores in the “Cores” column
  • “par +1 core” is parallel implementation using one more core than for “seq or par” column

Performance at 20K array size when using 2 cores is slower than using a single core. As array size grows, along with the number of cores applied, so does performance. Every step in array size increase and number of cores, performance goes up consistently.

Using a few cores results in higher performance for small arrays versus using all of the cores. For medium size arrays, linearly scaling the number of cores results nearly equivalent performance to using all the cores. Using fewer cores is more efficient, leaving the rest of the cores for other purposes.

Implementation

C++ Standard Parallel algorithms do not offer support for such a scalable method.

Implementation for the scalable parallel methodology above is shown in the “Practical Parallel Algorithms in C++ and C#” book, and is implemented in https://github.com/DragonSpit/ParallelAlgorithms C++ repository and in https://github.com/DragonSpit/HPCsharp C# repository. The interface looks like:

template< class _Type >
inline void sort_parallel(_Type* src, size_t left, size_t right, _Type* dst, size_t parallel_threshold = 32 * 1024)

where:

  • src is a generic pointer to the source array
  • dst is a pointer to the destination array (not-in-place sort)
  • left and right are array boundaries
  • parallel_threshold, the additional parameter defines array size below which a single core is used, and above which scalable number of cores are applied in a linear fashion described above

The last parameter provides a default value, which is set conservatively enough to apply to all target systems, and which a developer can tweak/optimize if desired.

Summary

Parallel algorithms don’t have to perform poorly for small array sizes. They don’t have to perform worse than sequential implementations. Providing scalable parallelism enables them to “perform no worse” than sequential for all problem sizes. By applying the appropriate number of workers (cores) for the size of the problem, parallel algorithms can avoid performance degradation. By not always throwing all the cooks into the kitchen, more efficient compute resource usage is also achieved.

From parallel algorithm users point of view, usefulness increases to not only be above 1 million element array size, but to all array sizes. Some amount of acceleration is obtained even for small arrays. Only in case of the smallest array is performance equivalent, and not worse, than sequential implementations.

This reduces the number of considerations that parallel algorithm integrators need to think about and simplifies their lives, enabling them to always apply a scalable parallel algorithm.

3 thoughts on “Can C++ Parallel Standard Algorithms Accelerate, Even Small Arrays?

Leave a comment