Can C++ Parallel Standard Algorithms Accelerate, Even Small Arrays?

My previous blog, C++ Parallel STL Benchmark, showed performance for all measured C++ Parallel Standard algorithms increased over sequential single-core implementations. Some algorithms scaled much better than others – by nearly 10X on a 14-core processor and over 20X on a 48-core. Only large arrays with 100 million integers were used for these benchmarks.

Let’s see how fast C++ Parallel algorithms process small, medium and large arrays. Table below shows parallel and sequential performance of one algorithm, all_off():

Algorithm	Array Size	seq	Std Dev	par	Std Dev	Parallel Speedup
all_of(std::	1,000	3707	838	91	35	0.025
all_of(dpl::	1,000	2640	756	115	33	0.044
all_of(std::	10,000	4068	374	677	263	0.17
all_of(dpl::	10,000	3144	979	592	210	0.19
all_of(std::	100,000	4152	370	3329	491	0.8
all_of(dpl::	100,000	3533	988	6413	1598	1.8
all_of(std::	1,000,000	4300	323	16240	1970	3.8
all_of(dpl::	1,000,000	4289	353	23257	3033	5.4
all_of(std::	10,000,000	3290	274	21266	3598	6.5
all_of(dpl::	10,000,000	2847	370	24502	3784	8.6
all_of(std::	100,000,000	3156	492	13194	1049	4.2
all_of(dpl::	100,000,000	4054	192	13808	676	3.4

The above benchmark was run in VisualStudio 2022. Laptop Windows 11 Power Options setting was set to Best Performance. Each algorithm was run 10K times to provide an average run time over many executions, along with variation shown by standard deviation. The std:: implementation is Microsoft’s parallel STL and dpl:: is Intel’s.

Parallel algorithms are as much as 40X slower than sequential version for small arrays. For arrays of 10K and smaller, parallel implementation slow performance down. For arrays of 1 million and larger, parallel version outperforms the sequential version consistently and substantially on the 14-core laptop processor. Between 10K and 1 million is a middle area where at times sequential version performs better and at other times parallel does.

Let’s explore possible reasons for the lack of performance and a suggested solution outlined in the “Practical Parallel Algorithms in C++ and C#” book.

Using Parallel Algorithms Just Got Complicated

From algorithms users perspective, applying C++ Parallel algorithms added complexity, since performance is not always higher for parallel implementation. Algorithms users need to determine the minimal size of arrays when parallel algorithm can be used, otherwise sequential implementation needs to be used. This threshold would need to be determined for each algorithm, and most likely for each target computer system.

From parallel algorithm developers perspective, it is disappointing that parallel version only provides performance benefit for large arrays, limiting and complicating usage. Not all problems are large. Now algorithm developers need to explain why this is happening and caution algorithm users about their proper usage.

Ideally, parallel algorithms should “perform no worse” not only for large arrays but also for medium and small arrays – i.e. for any array size. This ideal would simplify usage for parallel algorithms for all parties.

The Problem

Parallel implementations have substantial overhead of task setup and task switching. This overhead is hidden for large problems, since it is small relative to the amount of work within a large problem. However, for small problems (arrays) the overhead becomes substantial, and even larger than the work itself.

By default, Intel and Microsoft Parallel implementations use all of the cores available no matter the problem size. This leads to “too many cooks in the kitchen” situation, where all of the cores get a tiny task to work on and get in each others way while trying to get it done.

An example of too small of a problem in the kitchen is to have a task of slicing a single tomato for several cooks. Each cook makes a single slice in the tomato, as a turn at the task. It is more efficient and faster to have a single cook slice the entire tomato. In other words, some problems are too small for all of the cooks to be involved. For small problems a single cook is the fastest way. In this case, having all of the cooks slows things down.

A Scalable Solution to Simplify Usage

The measurements in the above Table show:

for small problems using a single core is fastest
for large problems using all cores is fastest

For the medium arrays somewhere in between single core and all of the cores is needed.

One possible idea, suggested in the “Practical Parallel Algorithms in C++ and C#” book, is to scale linearly the number of cores applied from small arrays to large arrays. For example, let’s say that arrays with less than 10K elements have been determined to be fastest when a single core is used, and arrays of 10K and larger using two cores is faster. Then for array sizes 10K – 20K array two cores are applied, for 20K – 30K array three cores, 40K – 50K array four cores, and so on, until all cores are used for large arrays.

In other words, for small arrays a single core is used, for large arrays all of the cores are used, and for medium size arrays more than one and less than all of the cores are applied. The number of workers is scaled with the size of the problem. The bigger the problem the more workers get applied.

Testing Scalability

The following Table shows performance measurements for linear scaling of array size and the number of cores:

Algorithm	Cores	Array Size	par	Std Dev	par +1 core	par +2 cores	par all cores
all_of(dpl::	1	20,000	3351	998	3078	3229	1803
all_of(dpl::	2	40,000	3918	739	4336	4321	3570
all_of(dpl::	3	60,000	5555	802	5261	5384	4571
all_of(dpl::	4	80,000	6005	922	6285	6402	5684
all_of(dpl::	5	100,000	6572	1263	7437	7342	6845
all_of(dpl::	6	120,000	7934	1479	8147	7692	7680
all_of(dpl::	7	140,000	8722	1492	8639	8647	8524
all_of(dpl::	8	160,000	9246	1176	9290	9552	9519
all_of(dpl::	9	180,000	9580	1365	9965	10249	9968
all_of(dpl::	10	200,000	10654	1300	10727	10759	10886
all_of(dpl::	11	220,000	11349	1415	11452	11515	11310
all_of(dpl::	12	240,000	11743	1572	11996	12292	11917
all_of(dpl::	13	260,000	12611	1487	12658	12731	12567
all_of(dpl::	14	280,000	13112	1558	13074	12443	13060
all_of(dpl::	15	300,000	13602	1599	12979	12947	13478
all_of(dpl::	16	320,000	13534	1677	13555	13484	14016
all_of(dpl::	17	340,000	13998	1964	13905	13952	14419
all_of(dpl::	18	360,000	14643	2214	14242	14126	14632
all_of(dpl::	19	380,000	14706	2349	14589	14483	14991
all_of(dpl::	20	400,000	15254	2253	15137	15006	15165

“par” is the implementation, using the number of cores in the “Cores” column
“par +1 core” is parallel implementation using one more core than for “seq or par” column

Performance at 20K array size when using 2 cores is slower than using a single core. As array size grows, along with the number of cores applied, so does performance. Every step in array size increase and number of cores, performance goes up consistently.

Using a few cores results in higher performance for small arrays versus using all of the cores. For medium size arrays, linearly scaling the number of cores results nearly equivalent performance to using all the cores. Using fewer cores is more efficient, leaving the rest of the cores for other purposes.

Implementation

C++ Standard Parallel algorithms do not offer support for such a scalable method.

Implementation for the scalable parallel methodology above is shown in the “Practical Parallel Algorithms in C++ and C#” book, and is implemented in https://github.com/DragonSpit/ParallelAlgorithms C++ repository and in https://github.com/DragonSpit/HPCsharp C# repository. The interface looks like:

template< class _Type >
inline void sort_parallel(_Type* src, size_t left, size_t right, _Type* dst, size_t parallel_threshold = 32 * 1024)

where:

src is a generic pointer to the source array
dst is a pointer to the destination array (not-in-place sort)
left and right are array boundaries
parallel_threshold, the additional parameter defines array size below which a single core is used, and above which scalable number of cores are applied in a linear fashion described above

The last parameter provides a default value, which is set conservatively enough to apply to all target systems, and which a developer can tweak/optimize if desired.

Summary

Parallel algorithms don’t have to perform poorly for small array sizes. They don’t have to perform worse than sequential implementations. Providing scalable parallelism enables them to “perform no worse” than sequential for all problem sizes. By applying the appropriate number of workers (cores) for the size of the problem, parallel algorithms can avoid performance degradation. By not always throwing all the cooks into the kitchen, more efficient compute resource usage is also achieved.

From parallel algorithm users point of view, usefulness increases to not only be above 1 million element array size, but to all array sizes. Some amount of acceleration is obtained even for small arrays. Only in case of the smallest array is performance equivalent, and not worse, than sequential implementations.

This reduces the number of considerations that parallel algorithm integrators need to think about and simplifies their lives, enabling them to always apply a scalable parallel algorithm.

Algorithm Performance

Measure, Question, Improve, Do It Again…

Can C++ Parallel Standard Algorithms Accelerate, Even Small Arrays?

Using Parallel Algorithms Just Got Complicated

The Problem

A Scalable Solution to Simplify Usage

Testing Scalability

Implementation

Summary

3 thoughts on “Can C++ Parallel Standard Algorithms Accelerate, Even Small Arrays?”

Leave a comment Cancel reply

Using Parallel Algorithms Just Got Complicated

The Problem

A Scalable Solution to Simplify Usage

Testing Scalability

Implementation

Summary

Share this:

Related

3 thoughts on “Can C++ Parallel Standard Algorithms Accelerate, Even Small Arrays?”

Leave a comment Cancel reply