Every programming language provides methods for generating random numbers, because random numbers are useful in so many situations. For years the C programming language only provided the rand() function, which generates 15 random bits per function call. To generate a 32-bit integer, it takes 3 calls to rand() to accumulate 32 bits. On my quad-core sub-$1,000 laptop using rand() to generate an array of 500 Million integers runs at:
|Function||Number of Bits||MegaIntegers/sec|
C++ added a rich and modern set of methods for generating different types of random numbers with variety of statistical distributions. The good old rand() function is still there for backward compatibility, with its poor statistical characteristics. It has been deprecated, replaced by the much richer, more capable set of random generation functions using a variety of algorithms. Let’s check out capabilities and performance of these functions, generating a uniform distribution (0.0 to 1.0 for floating-point, and -2 billion to 2 billion for 32-bit integers).
C++ also added support for a standard way to access hardware based entropy generators, if its available on the processor. Performance will vary from processor to processor, with the following measurements on my laptop:
- 13 MegaUnsignedIntegers/sec
Intel has developed two powerful accelerated algorithms libraries: Integrated Performance Primitives (IPP) and Math Kernel Library (MKL). These libraries use data-parallel processor instructions to accelerate computation. These libraries provide accelerated pseudo-random number generator algorithms. MKL algorithms are not multi-threaded, but provide support for use in multi-threaded code. The following graph shows performance measurements of some of them. Performance was measured with a single channel of system memory, with algorithms running on a single core:
The following graph shows performance of MKL random number generator algorithms for arrays that are small enough to fit into the first level of CPU cache (L1 cache), as well as progressively larger arrays that fit into the second level of cache (L2 cache), third level of cache (L3 cache), and lastly arrays that are too big to fit into any cache levels of the CPU yet fit into system memory. The units are MegaFloats/sec.
Note that for some algorithms, such as NIEDERR and SOBOL, performance is significantly higher when the array is in cache. For NIEDERR algorithm, performance drops progressively as the array gets bigger. These benchmarks were performed on a laptop with with dual-channel DDR4 system memory. Notice the significantly higher performance level versus the previous table which was measured on the same laptop with a single channel of DDR4 system memory. These algorithms ran on a single core of a quad-core processor. The faster algorithms are limited by system memory bandwidth to slightly under 3 GigaFloats/sec when using arrays larger than cache memory.
Intel has also developed an Integrated Performance Primitives (IPP) library of algorithms, which also support random number generation algorithms, whose performance is shown below:
A Curious Performance Behavior
A curious performance behavior was encountered while benchmarking generation of 500 Million element array of random numbers, which was used for all of the above measurements. Performance of MKL routines started out at 2X to 4X slower than the performance listed above. This happened only the first time the array of random numbers was generated, and then zoomed to the above performance. John McCalpin explains this issue perfectly:
“If the original test case did not instantiate the pages for the 500 million element array (e.g., as was done with the memset() call), then the time required for the first MKL call will include the time for the OS to instantiate those pages. This can be a significant overhead — the actual cost depends on the OS, the processor family, the number of processors, etc.”
In other words, allocating 500 Million elements of the array does not mean that the array is actually in system memory. Only when an element per page, that a particular operating system uses, has been touched will the entire array be in system memory, ready to be filled by the random number generator. If the random number generator is the first to fill the array, then it will incur the performance degradation associated with creating each OS page of the array in system memory.
It is not fair to include the operating system overhead for creating and paging in an array in the benchmarks of the random number generation algorithms. For this reason the benchmarks show the performance based on the array already pre-allocated and already in system memory. However, as users of variety of algorithms we need to be aware of such situations.
While exploring random number generation I encountered an issue with C++ standard library on Windows under VisualStudio 2015. When generating a uniform distribution of doubles, setting the range of 0.0 to 1.0 works, but setting the range of 0.0 to DBL_MAX does not produce the requested range. Also, setting the range of DBL_MIN to DBL_MAX does not produce the requested range, or even backing off the maximum value to FLT_MAX still produces results that are not in the requested range.
My Dr. Dobb’s article “Quality of Random Numbers versus Performance”, which shows the issue with C++ rand() function statistically and graphically. Zoom in on the images to see patterns, which indicate lack of randomness.