In my previous blogs, pseudo random number generators (PRNGs) running on a multi-core processor (CPU) or graphics processor (GPU) were shown to have vastly superior performance to those in the standard C++ libraries. Using several CPU cores, utilizing parallel instructions within each core paid off for CPU-based generators. Using hundreds of GPU cores took performance to another level. In this blog let’s explore how to go even faster.
To take full advantage of its multi-core processors, Intel has created a powerful library of algorithms: the MKL library. This library has several PRNG algorithms which scale well across cores for faster computation, as I showed in CPU Random Number Generators.
Nvidia has created several libraries of algorithms optimized to take advantage of 100’s of graphics cores within their graphics processor (GPU) family, for faster computation. One of these libraries is a powerful cudaRand library. GPUs also support fast local memory for high bandwidth and large capacity. For instance, the GeForce 1070 desktop GPU comes with 8 GBytes of its own memory and 1920 graphics cores. Gaming laptops are including GPUs at this level. I showed performance of the GPU in GPU Random Number Generators
Going Even Faster
To go even faster, I developed a library that stands on the shoulders of Intel and Nvidia giants, taking advantage of the CPU and the GPU simultaneously. By harvesting the compute power of all cores of the CPU and all cores of GPU working together, running the best algorithms on each of them, even higher level of performance can be achieved. The following performance measurements show this benefit:
- 3 billion floating-point randoms per second, multi-core CPU alone.
- 6 billion floating-point randoms per second, GPU alone.
- 8 to 9 billion floating-point randoms per second, CPU and GPU working together.
By using more of the computational resources within the computer, even higher level of performance can be achieved. Performance gain of 30 to 50% is possible.
Few Gory Details
The above performance levels were achieved on my laptop, which has an Intel Core i5-6300HQ CPU with 16 GigaBytes of DDR4, and a GeForce GTX 950M GPU with 2 GigaBytes of GDDR5. The floating-point random numbers were of uniform distribution between 0.0 and 1.0. The randoms were generated in GPU memory and in CPU memory simultaneously. My laptop was plugged in when running all benchmarks.
The above level of performance gain fails to show up when the laptop runs on battery power. When on battery power, the CPU alone runs at its maximum performance. Same for the GPU when it runs alone. However, when running CPU and GPU together, the performance reduces below that of the GPU alone. It could be that the clock of the CPU and the GPU get throttled to maintain the power draw at levels the battery can safely sustain.
I verified this behavior by looking at the CPU clock frequency while running these benchmarks. When the laptop was plugged in, the CPU clock was running at 3 GHz during the benchmark. However, when the laptop was running on battery power, the CPU clock varied between 2 and 2.5 GHz. This shows the laptop is actively slowing the CPU down to manage the power draw.
I also monitored the GPU core clock and its memory clock. However, these were not throttled down by the laptop during the benchmark. It seems that during battery operation, the GPU is allowed to run at full speed, while the CPU clock is throttled to manage the total power draw.