In my previous blogs, performance acceleration via parallelism worked well for large problems, but for small problems slowed performance down significantly. A solution to this dilemma was suggested, which applied 1 core for the smallest problems, 2 cores for larger problems, and so on, scaling the number of cores to the problem size – avoiding “too many cooks in the kitchen”. This moved us closer to “perform no worse” when using parallel algorithms compared to sequential algorithms. In this blog, let’s explore deeper to see if this is a fundamental method for parallel performance scaling.
Atom of Work
An atom of work is the smallest unit of work that can be executed efficiently in parallel. It is not the smallest amount of work possible, but is the smallest worthwhile to run in parallel. For instance, Intel Threading Building Blocks (TBB) provides the following recommendation for the grain size (atom of work):
“TBB tasks should be on average greater than 1 microsecond to effectively hide the overheads of work stealing. This translates to several thousand CPU cycles – if you prefer using cycles, we suggest a 10,000 cycle rule of thumb.” – Pro TBB book
When an array is being worked on by a parallel algorithm, size of an atom of work depends on the algorithm and the data in the array. Some algorithms perform many computations, others perform few. Thus, the number of array elements which make an atom of work varies from algorithm to algorithm, and by data type.
Atom of work can be several thousand elements of an array or as little as one.
Parallel Acceleration (Speedup)
Parallel acceleration (speedup) is how much faster a parallel implementation is versus a sequential equivalent. It’s a ratio of time it takes a sequential implementation to run divided by the time it takes for a parallel implementation to execute. For instance, when a parallel implementation is 5 times faster, the speedup (parallel acceleration) is 5.
When a single atom of work is available, no parallel acceleration is possible, since only a single worker can work on atom of work. There is only enough work for a single worker. Thus, the best parallel acceleration (speedup) that is possible is 1.
When two atoms of work are available, either one or two workers can be applied, resulting in the best possible speedup of 2. When three atoms of work are available, up to three workers can be applied, resulting in the best possible speedup of 3. This pattern continues as the amount of work increases, enabling more workers to be applied in parallel, leading to a potential increase in parallel acceleration (speedup).
Thus, parallel acceleration (speedup) is possible at small scale, with less speedup for smaller problems which contain a small number of work atoms than larger problems containing more work atoms.
When the problem is small, with the number of work atoms being smaller than the number of parallel workers, then the maximum speedup possible is equal to the number of work atoms. Only, when the number of work atoms exceeds the number of parallel workers, utilization of all available workers in parallel becomes a possibility.
Not All Cases Accelerate Fully
The size of atoms of work limit how much speedup is possible for small arrays. Speedup is described by the following equation:
Speedup = TimeForSequentialExecution / TimeForParallelExecution
For small arrays less parallelism is available, leading to less potential speedup. For small arrays, speedup will be smaller than the number of available workers. Once array size is large enough, all available workers (such as cores) can be utilized efficiently. For an atom of work size of 20K array elements, as used in the previous blog, for an array size of 20K or smaller, one worker (core) should be used, providing a speedup of 1. For an array size of 20K to 40K array elements, two cores should be used, providing a speedup of up to 2. This pattern continues… For a 20 core processor, an array of 400K elements or larger would utilized all of the cores efficiently.
The following equation shows the minimum array size that can be fully accelerated by W workers:
MinimumFullyAcceleratedArraySize = W * SizeOfWorkAtom
For example, for a processor with 100 cores (workers) and 20K atom size, array size has to be greater than 2 million elements to be able to use all cores.
Size of Atom of Work Limits Parallelism
For small arrays maximum available speedup is limited by the following equation, up to the number of cores (C):
MaximumAvailableSpeedup = Min( C, ArraySize / SizeOfAtomOfWork )
Size of atom of work limits the smallest array size which can utilize all of the cores efficiently. If the atom of work size could be reduced by an order of magnitude (10X), to 2K array elements, then an array of only 200K elements and larger would utilize all cores within a 100-core processor. The smaller the atom of work size is, the smaller the array size is when all cores of a processor could be utilized efficiently. Current fairly large atom of work size for TBB and PPL for C++ and TPL for C# are limiting parallel performance for small arrays.
If the overhead could be reduced significantly, smaller problems would enjoy acceleration from more cores. This would enable more applications to benefit from parallelism, as not all applications process large arrays, but many have access to multi-core processors.
2 thoughts on “Parallel Acceleration at Small Scale”