Professor Blelloch and his team at Carnegie Mellon University have designed and developed a parallel algorithms library over the last decade – ParlayLib. It provides numerous parallel algorithms targeting shared-memory multicore processors. It is similar to Intel’s Threading Building Blocks (TBB), providing a works-stealing scheduler, but also goes beyond with support for additional parallel primitives and algorithms.
ParlayLib implements many of the Parallel STL algorithms. It can also co-exist with TBB, OpenMP, and Cilk. This “Brief Announcement” paper is a wonderful introduction to ParlayLib capabilities.
Benchmark
ParlayLib provides an extensive benchmark suite, using Googles open source benchmarking library, which runs various algorithms multiple times and gathers statistics across them.
Getting Started
ParlayLib is simple to get started, supporting Linux and Windows. I put together a “Getting Started with ParlayLib” repository, which provides directions and shows the steps necessary to get a “Hello World!” ParlayLib application running on Linux and Windows.
The following Table shows performance for two ParlayLib algorithms when sorting an array of 1 billion 32-bit integers on a 12-th generation Intel i7-12700H laptop processor (14-core with hyperthreading):
| Algorithm | Random | Pre-sorted | Constant |
| Parallel Merge Sort | 115 | 394 | 413 |
| Parallel Integer Sort | 943 | 714 | 1720 |
The units of measure in the table are millions of integers/second.
The above measurement was performed on WSL (Ubuntu) on Windows. Parallel Integer Sort performed slower by more than twice slower on Windows, while Parallel Merge Sort performance was nearly the same.
The following Table shows performance for two ParlayLib algorithms when sorting an array of 1 billion unsigned 32-bit integers on a C7i.24xlarge AWS node (48-core with hyperthreading):
| Algorithm | Random | Pre-sorted | Constant |
| Parallel Merge Sort | 512 | 1303 | 1303 |
| Parallel Integer Sort | 3736 | 3048 | 5568 |
The following Table shows performance on a C7i.48xlarge AWS node (96-core Intel CPU with hyperthreading):
| Algorithm | Random | Pre-Sorted | Constant |
| Parallel Merge Sort | 865 | 1405 | 1314 |
| Parallel Integer Sort | 4195 | 3709 | 3831 |
The following Table shows performance on a C7a.48xlarge AWS node (96-core AMD CPU with hyperthreading):
| Algorithm | Random | Pre-Sorted | Constant |
| Parallel Merge Sort | 670 | 1929 | 1907 |
| Parallel Integer Sort | 5382 | 4829 | 10152 |
Azure Standard HB176-144rs v4 (144-core 4-th generation AMD EPYC CPU with 2.3 GByte 3D V-Cache without hyperthreading):
| Algorithm | Random | Pre-Sorted | Constant |
| Parallel Merge Sort | 935 | 1613 | 1554 |
| Parallel Integer Sort | 6725 | 6920 | 9891 |
Azure Standard Standard HB176rs v4 (176-core 4-th generation AMD EPYC CPU with 2.3 GByte 3D V-Cache without hyperthreading):
| Algorithm | Random | Pre-Sorted | Constant |
| Parallel Merge Sort | 1000 | 1691 | 1457 |
| Parallel Integer Sort | 7189 | 7189 | 9756 |
One thought on “ParlayLib Parallel Algorithms Library”