We think of computer memory as a RAM – random access memory – which means memory can be accessed at any location with nearly the same latency. This is true of certain kinds of memory types, such as static RAM – SRAM. As we shall see in this blog, when system memory is made of DRAM (dynamic RAM), how memory is accessed makes a dramatic performance difference.
When accessing DRAM system memory, such as DDR4 in my Dell G5 laptop, sequentially performance is at its maximum. For example, computing a sum of an array with 200 Million 32-bit integers runs at 7 GigaBytes/sec, when accessing the array sequentially from its beginning to its end. However, computing a sum of the same array, but accessing it at random locations within the array, performance drops by 25 times down to 0.275 GigaBytes/sec.
This dramatic reduction in performance is due to DRAM type of memory, which has been optimized for sequential access over the years, while hardly improving latency.
Accessing memory at 7 GigaBytes/sec is good, but is not the best that we can do. My HPCsharp library (free on nuget.org) has a multi-core sum implementation which uses SIMD/SSE parallel instructions and multiple cores to go even faster. Using this library to compute the sum runs at 32 GigaBytes/sec, which is substantially faster, while using the 6 cores available in the processor. On a 14-core Xeon processor with 4 memory channels, it runs at 54.4 GigaBytes/sec, which is 13.6 GigaAdds/sec.
Is System Memory RAM?
I’m not the first to point out that our RAM is not living up to its name, since it’s much worse at random accesses than sequential accesses. Jim Blinn (a researcher at Microsoft) pointed this out to some extent in his 2003 article “Fugue on MMX” – pointing out that RAM is supposed to mean accessing any byte within memory should take the same amount of time. He pointed out that if a byte being accessed was not in cache, then it takes extra time to bring a cache line into cache.
The measurements presented above show that our current RAM memory hierarchy is much better at sequential accesses than random accesses. Maybe we should start calling system memory SAM – sequential access memory – since it’s much better at sequential accesses than random accesses.
Cache memory in most processors is made not out of DRAM, but our of SRAM (static random access memory), which is much better at random access. Cache memory then, is expected to perform better for random accesses. Sure enough, random accesses are only 2 times worse than sequential accesses, for L1 and L2 caches, with bandwidth being around 3.6 GigaBytes/sec and sequential access bandwidth around 7.6 GigaBytes/sec. Random access bandwidth is lower for L3 cache at 2.3 GigaBytes/sec, which is 3.3 times slower than sequential accesses.
This optimization for sequential accesses has implications for some kinds of computations being perform. Any algorithm using sequential memory accesses benefits, while any algorithm using random memory accesses suffers by 25X or more. If the algorithm is data-dependent and the data happens to be random, it suffers in performance by more than an order of magnitude.
It would be really amazing if memory designers and computer architects would work on improving memory access latency – to improve random access times. This is the hard one to improve, as it’s hard to fight physics. However, benefits would be enormous, as all computation would improve in performance, and no algorithm would be left behind.