ParlayLib Parallel Algorithms Library

Professor Blelloch and his team at Carnegie Mellon University have designed and developed a parallel algorithms library over the last decade – ParlayLib. It provides numerous parallel algorithms targeting shared-memory multicore processors. It is similar to Intel’s Threading Building Blocks (TBB), providing a works-stealing scheduler, but also goes beyond with support for additional parallel primitives […]

Sorting 19X Faster Than C++ Parallel Sort

In my previous blog Standard C++ Sort was benchmarked running on a single core of an Intel processor at 11 million 32-bit integers per second. Its parallel version scaled up to 93 million integers per second on a 48-core Xeon processor AWS node (C5.24xlarge) – providing 8X speedup. Also, my implementation of Parallel Merge Sort […]

Improving Parallel Performance for Small Arrays

In my previous blogs, Standard C++ Parallel Algorithms were shown to accelerate well on multi-core processors for large arrays, but slowed performance down for small arrays. In this blog, let’s explore a way to increase performance for small arrays. Data Parallelism Many Standard C++ Algorithms support several execution modes. C++ standard supports four modes: seq, […]

Parallel Acceleration at Small Scale

In my previous blogs, performance acceleration via parallelism worked well for large problems, but for small problems slowed performance down significantly. A solution to this dilemma was suggested, which applied 1 core for the smallest problems, 2 cores for larger problems, and so on, scaling the number of cores to the problem size – avoiding […]

C++ Parallel STL on GPUs

Under Construction… Nvidia has added standard C++ parallel algorithms on GPUs. Algorithm seq unseq par par_unseq GPUSpeedup max_element(std:: 1600 1613 1620 1581 1.0 adjacent_difference(std:: 2052 2062 996 0.5 adjacent_find(std:: 2963 2947 37 all_of(std:: 3652 3752 34 any_of(std:: 3652 3584 37 count(std:: 2999 2987 1627 equal(std:: 3839 3716 37 copy(std:: 4421 4525 1529 merge(std:: 201 197 […]

Can C++ Parallel Standard Algorithms Accelerate, Even Small Arrays?

My previous blog, C++ Parallel STL Benchmark, showed performance for all measured C++ Parallel Standard algorithms increased over sequential single-core implementations. Some algorithms scaled much better than others – by nearly 10X on a 14-core processor and over 20X on a 48-core. Only large arrays with 100 million integers were used for these benchmarks. Let’s […]

C++ Parallel STL Benchmark

C++ includes a standard set of generic algorithms, called STL (Standard Template Library). On Windows, Microsoft provides parallel versions of these algorithms, listed below with the first argument being std::. Also, Intel provides parallel implementations, listed below with the first argument being dpl::. These parallel versions utilize multiple cores of the processor, providing substantially faster […]

Practical Parallel Algorithms Book Additional Resources

This page provides additional resources, such as correction, additions, errata, and updates to the book. Contact Information vduvanenko2@gmail.com has been setup for correspondence about the book. Don’t hesitate to a-mail a note with questions, suggestions, corrections, improvements, or missing information. Updates Benchmarks on 12-th generation laptop (14-cores) and 4-th generation Xeon AWS node (48-core C7i.24xlarge) […]

Maximum Read Bandwidth

This blog explores methods to reach maximum read bandwidth. This is a useful basic operation which limits performance of many algorithms serial or parallel. Knowing how to reach the maximum available read bandwidth is beneficial in many instances. One way to test performance of memory reading is to implement a Summation. Summing elements of an […]

When to Trust Chip Synthesis

Before synthesis, chip designers used to implement all aspects of chip design by hand. This included combinational logic implementation, storage elements such as SRAMs, flip-flops and latches. Netlist as well as place-and-route were also done fairly manually. High-level design implementations, either in C, Verilog or another high-level language, were manually translated into gates. When synthesis […]

Algorithm Performance

Measure, Question, Improve, Do It Again…