Parallel Pattern of Bundling Small Work Items

Parallel programming frameworks like C++ Threading Building Blocks (TBB) and C# Task Parallel Library (TPL) are good at handling large work items using multi-core processors. Both provide mechanisms to break a large problem into smaller pieces (grains), to provide work for multiple cores in parallel, gaining performance acceleration. With 20-core laptops widely available along with 176-core cloud nodes, parallel acceleration can be substantial, as shown in C++ Parallel STL Benchmark blog.

However, when processing a small work item, severe performance drop can occur. Over 100X slow down for a standard C++ parallel algorithm is demonstrated in Improving Parallel Performance for Small Arrays blog, which also offers a Parallel Pattern to avoid drop in performance. By scaling the number of cores to the size of the problem, instead of always using all cores, this substantial performance decrease can be avoided. However, the amount of acceleration obtained from using multiple cores is reduced. For small work items performance drop is eliminated, but results in no parallel acceleration.

In this blog, a second Parallel Pattern is explored, which provides parallel acceleration for small work items , for a certain scenario.

Bundling

When a work item is very large, such as sorting a very large array, there is plenty of work for multiple workers, such as multiple cores of a processor. Breaking such a large array into smaller pieces to process in parallel by each available worker is a wonderful way to reduce the time it takes to get the job done.

For example, if a work item is large enough for two workers to work independently to get the job done, then splitting the work item into two pieces will reduce the time to get the job done.

The opposite case is where two work items which are too small to parallel processes efficiently need to be bundled together to be processed by a single worker efficiently. When many such small work items are to be parallel processed, then this multitude of small work items can be grouped into several bundles, with each bundle processed by a single worker.

Bundling is a natural complement to splitting:

  • Large work items are split into multiple smaller work chunks, where each chunk can be processed efficiently by a worker.
  • Multiple small work items are bundled together to create a large enough work bundle that can be processed efficiently by a worker.

Bundling Example

A good real-world example of bundling is a list of chores requiring going out of the house. Instead of going out to each destination individually, coming back home after completing each chore, we would bundle several chores together by making multiple stops before returning home. This is more efficient in terms of time and energy usage, by eliminating the overhead of leaving home and returning home between each chore on the list.

Bundling multiple small work items for a multi-core processor operates in the same way, by eliminating parallel overhead for each small work items. The parallel overhead is spread across multiple small work items, raising efficiency, performance and opportunity for parallel acceleration.

C++ Implementation

To implement bundling in C++ is tricky. Here is one such implementation:

#include <iostream>
#include <tbb/task_group.h>
#include <vector>
#include <functional>

void foo(int arg) {
    std::cout << "arg = " << arg << std::endl;
}

int main() {
    tbb::task_group g;

    // Create a vector of function calls
    std::vector<std::function<void()>> function_bundles;

    // Populate the vector with function calls
    function_bundles.push_back([] { foo(1); });
    function_bundles.push_back([] { foo(2); });
    // Add more function calls as needed

    // Run all function calls as a single task
    g.run([&function_bundles] { 
        for (const auto& func : function_bundles) {
            func(); // Execute the function
        }
    });

    g.wait(); // Wait for all tasks to finish
}

The above implementation was implemented by Mark Lubin (moderator of Intel’s TBB user forum) as a response to my question about TBB supporting bundling of many small work items to run by a single worker. Let’s analyze this powerful implementation, its benefits, and its similarity to an async pattern.

The above code creates a vector of function calls, which is the bundle of work to be executed by a single core. Two calls to “foo()” function are added to the bundle using lambda capture. Using lambda capture enables not only capture of multiple calls to a single function, but capture of multiple calls to different functions.

Once the bundle of function calls has been made, it can be executed later in time. The above implementation creates a single bundle, and can be extended to creating multiple bundles for parallel execution. Each bundle would then be executed by running the “g.run()” call. Waiting for all parallel bundles to finish executing is done using the “g.wait()” call on the task_group.

Benchmark

Benchmarking setup is 100,000 arrays of 1,000 integers, with each array initialized to the value of 2. Various number of arrays are bundled together, with each bundle executed by a single core on a 20-core laptop or 176-core AMD processor node in Azure. Time to create bundles and to execute them is shown in the Table below for the 176-core AMD node:

Number of BundlesBundle SizeTime to Create
Bundles
Time to Execute
Bundles
1100,0005.0 ms19.7 ms
250,0002.5 ms18.4 ms
425,0003.6 ms12.1 ms
812,5003.7 ms9.2 ms
166,2503.9 ms7.5 ms
323,1254.3 ms5.4 ms
641,5624.7 ms4.8 ms
1287814.9 ms4.0 ms
2563904.9 ms3.3 ms

Time to run all arrays sequentially using a single processor, without bundling, is 18.2 milliseconds. Starting with 4 bundles, processing using bundles is faster than sequential processing. Using 256 bundles, results in the total run time of 8.2 ms, which is 2.2X faster than sequential processing.

Number of BundlesBundle SizeTime to Create
Bundles
Time to Execute
Bundles
110,0000.5 ms4.6 ms
25,0000.2 ms2.0 ms
42,5000.2 ms1.4 ms
81,2500.2 ms0.9 ms
166250.2 ms0.8 ms
323120.2 ms0.8 ms
641560.2 ms0.7 ms
128780.2 ms0.8 ms
256390.3 ms0.8 ms

Time to run all arrays sequentially using a single processor, without bundling, is 4.5 milliseconds. Time to run all arrays in parallel, without bundling, is 46,372 ms (i.e. over 46 seconds), which is over 10K times slower than sequential (single core) processing.

Starting with 2 bundles, processing using bundles is faster than sequential processing. Using 16 bundles, results in the total run time of 1.0 ms, which is 4.5X faster than sequential processing.

Source Code

Example code for bundling small work items and the performance benchmark code are here.

Performance Optimizations

The example implementation creates several bundles, waiting to run them until all bundles have been created. Instead, bundles could be run as soon as they are created, overlapping execution and creation.

Conclusion

Parallel acceleration for small work items is difficult. Handing each small work item to each core will incur large overhead of parallelism. If each item is small enough, parallelizing in such a way slow things down severely – over 10K performance hit shown above. In fact, in many cases, it is faster to process many small items by a single core than by using multiple cores in parallel.

This blog shows a way to bundle multiple small work items for each worker (processor core), to provide sufficient work for each core to reduce parallel overhead to negligible level. This results in parallel acceleration.

Leave a comment