This is a reprint of my Dr. Dobb’s article from May of 2004. Yet, it is still relevant today, with memory bandwidth within the memory hierarchy being critical to reaching full performance potential. Optimization often demands diving into system memory and processor cache. Intel architecture CPUs are made of several parts working together to execute […]Read more "Memory Hierarchy Bandwidth"
This blog is a re-post of my Dr. Dobb’s Journal article in February of 2011. All of the source code, including a working VisualStudio 2015 solution with examples is on GitHub. For the next several articles in this series, we’ll explore parallel and sequential merge algorithms. We’ll utilize Intel Threading Building Blocks as well as […]Read more "Parallel Merge"
This blog is a re-post of my Dr. Dobb’s Journal article from March of 2011. All of the source code, including a working VisualStudio 2015 solution with examples is on GitHub. In last month’s article in this series, a Parallel Merge algorithm was introduced and performance was optimized to the point of being limited by system memory bandwidth. […]Read more "Parallel Merge Sort"
In late 1996 I developed a recursive hardware multiplier, and presented it at the Synopsys User Group conference in 1998. I recently ran across Karatsuba algorithm for fast multiplication, where its recursive application reminded me of my recursive multiplier. I was mainly after increasing performance for fairly small multipliers, ease of pipelining, and not in […]Read more "Recursive Multiplier in VHDL"
In this blog I’ll gather introductory material that is useful when you’re starting out with OpenCL, including links to videos, introductory source code for first projects, and information on how to get VisualStudio setup for OpenCL on Windows. Video introduction to OpenCL is a nice introduction to OpenCL terminology and the overall concepts. It’s an hour […]Read more "OpenCL Introduction"
In my previous blogs, pseudo random number generators (PRNGs) running on a multi-core processor (CPU) or graphics processor (GPU) were shown to have vastly superior performance to those in the standard C++ libraries. Using several CPU cores, utilizing parallel instructions within each core paid off for CPU-based generators. Using hundreds of GPU cores took performance […]Read more "Faster Random Number Generator"