ParlayLib Parallel Algorithms Library

Professor Blelloch and his team at Carnegie Mellon University have designed and developed a parallel algorithms library over the last decade – ParlayLib. It provides numerous parallel algorithms targeting shared-memory multicore processors. It is similar to Intel’s Threading Building Blocks (TBB), providing a works-stealing scheduler, but also goes beyond with support for additional parallel primitives and algorithms.

ParlayLib implements many of the Parallel STL algorithms. It can also co-exist with TBB, OpenMP, and Cilk. This “Brief Announcement” paper is a wonderful introduction to ParlayLib capabilities.

Benchmark

ParlayLib provides an extensive benchmark suite, using Googles open source benchmarking library, which runs various algorithms multiple times and gathers statistics across them.

Getting Started

ParlayLib is simple to get started, supporting Linux and Windows. I put together a “Getting Started with ParlayLib” repository, which provides directions and shows the steps necessary to get a “Hello World!” ParlayLib application running on Linux and Windows.

The following Table shows performance for two ParlayLib algorithms when sorting an array of 1 billion 32-bit integers on a 12-th generation Intel i7-12700H laptop processor (14-core with hyperthreading):

Algorithm	Random	Pre-sorted	Constant
Parallel Merge Sort	115	394	413
Parallel Integer Sort	943	714	1720

The units of measure in the table are millions of integers/second.

The above measurement was performed on WSL (Ubuntu) on Windows. Parallel Integer Sort performed slower by more than twice slower on Windows, while Parallel Merge Sort performance was nearly the same.

The following Table shows performance for two ParlayLib algorithms when sorting an array of 1 billion unsigned 32-bit integers on a C7i.24xlarge AWS node (48-core with hyperthreading):

Algorithm	Random	Pre-sorted	Constant
Parallel Merge Sort	512	1303	1303
Parallel Integer Sort	3736	3048	5568

The following Table shows performance on a C7i.48xlarge AWS node (96-core Intel CPU with hyperthreading):

Algorithm	Random	Pre-Sorted	Constant
Parallel Merge Sort	865	1405	1314
Parallel Integer Sort	4195	3709	3831

The following Table shows performance on a C7a.48xlarge AWS node (96-core AMD CPU with hyperthreading):

Algorithm	Random	Pre-Sorted	Constant
Parallel Merge Sort	670	1929	1907
Parallel Integer Sort	5382	4829	10152

Azure Standard HB176-144rs v4 (144-core 4-th generation AMD EPYC CPU with 2.3 GByte 3D V-Cache without hyperthreading):

Algorithm	Random	Pre-Sorted	Constant
Parallel Merge Sort	935	1613	1554
Parallel Integer Sort	6725	6920	9891

Azure Standard Standard HB176rs v4 (176-core 4-th generation AMD EPYC CPU with 2.3 GByte 3D V-Cache without hyperthreading):

Algorithm	Random	Pre-Sorted	Constant
Parallel Merge Sort	1000	1691	1457
Parallel Integer Sort	7189	7189	9756

Other Resources

Overview video

Algorithm Performance

Measure, Question, Improve, Do It Again…

ParlayLib Parallel Algorithms Library

Benchmark

Getting Started

Other Resources

One thought on “ParlayLib Parallel Algorithms Library”

Leave a comment Cancel reply

Benchmark

Getting Started

Other Resources

Share this:

Related

One thought on “ParlayLib Parallel Algorithms Library”

Leave a comment Cancel reply