Memory Access

Computer system memory is thought of as RAM – Random Access Memory. Originally, this meant accessing any random location took the same amount of time. This is true of certain kinds of memory types, such as Static RAM – SRAM. In this section we’ll discuss how the current computer system memory has significantly deviated from being RAM. This has considerable implication on performance, depending on how memory is accessed, as we’ll see from measurements of latency and bandwidth of various access patterns.

All benchmarks were run on a Dell G5 laptop with Intel i7-9750H processor, which uses DDR4 DRAM (dynamic RAM) memory for its system memory .

Read Performance

Using C++ to compute a sum of an array with 500 Million 32-bit integers runs at 20 GigaBytes/sec, when accessing the array sequentially from its beginning to its end, with delay (latency) of 0.2 nanoseconds.

inline unsigned long sumDirect(unsigned long* src, unsigned long length)
{
    unsigned long sum = 0;
    for (unsigned long i = 0; i < length; i++)
        sum += *src++;
    return sum;
}

However, when computing a sum of the same array, while accessing it at random locations within the array, bandwidth drops by 44 times down to 0.45 GigaBytes/sec. Latency, drops by 275 times to 55 nanoseconds. Random access of the array is accomplished by the following C++ code, with the index array populated by a random index – random integer between 0 and (length – 1):

inline unsigned long sumIndirect(unsigned long* srcA, unsigned long length, unsigned long* index)
{
    unsigned long sum = 0;
    for (unsigned long i = 0; i < length; i++)
        sum += srcA[index[i]];
    return sum;
}

This dramatic reduction in performance is due to DRAM type of system memory, which has been optimized for sequential access over decades, while hardly improving latency.

For computing a sum, only memory reads are used.

Read & Write Performance Summary

Operation Source Sequential Source Random Destination Sequential Destination Random Bandwidth Gbytes/sec Latency nanoseconds
Read Y 20 0.2
Read Y 0.45 55
Write Y 27 0.15
Write Y 0.075 53
Copy Y Y 16 0.26
Copy Y Y 0.26 16
Copy Y Y 0.093 43
Copy Y Y 0.106 38

The above table show how dramatically higher performance is when either source or destination accesses are sequential. Random accesses of either or both the source and destination drops performance in bandwidth and in latency. Latency drops between 60X and 350X.

Write Performance

Using C++ to write to memory sequentially runs at 27 GBytes/sec, with latency of 0.15 nanoseconds.

inline void writeDirect(unsigned long* dst, unsigned long value, unsigned length)
{
    for (unsigned i = 0; i < length; i++)
        *dst++ = value;
}

Writing to random locations runs at 75 MBytes/sec (not GBytes/sec), with latency of 53 nanoseconds.

inline void writeIndirect(unsigned long* dst, unsigned long value, unsigned long* index, unsigned length)
{
    for (unsigned i = 0; i < length; i++)
        dst[index[i]] = value;
}

Copy Performance

Copying reads the source array and writes it to a destination array. Using C++ runs at 16 GBytes/sec, with latency of 0.26 nanoseconds.

inline void copyDirect(unsigned long* src, unsigned long* dst, unsigned length)
{
    for (unsigned i = 0; i < length; i++)
        *dst++ = *src++;
}

Copying from a source array to random locations within the destination runs at 260 MBytes/sec, with latency of 16 nanoseconds.

inline void copyIndirectDestination(unsigned long* src, unsigned long* dst, unsigned long* index, unsigned length)
{
    for (unsigned i = 0; i < length; i++)
        dst[index[i]] = *src++;
}

Copying from random locations within the source array to the destination runs at 93 MBytes/sec, with latency of 43 nanoseconds.

inline void copyIndirectSource(unsigned long* src, unsigned long* dst, unsigned long* index, unsigned length)
{
    for (unsigned i = 0; i < length; i++)
        *dst++ = src[index[i]];
}

Copying from random locations within the source array to random locations within the destination array runs 106 MBytes/sec, with latency of 38 nanoseconds.

inline void copyIndirectSourceAndDestination(unsigned long* src, unsigned long* dst, unsigned long* indexSrc, unsigned long* indexDst, unsigned length)
{
    for (unsigned i = 0; i < length; i++)
        dst[indexDst[i]] = src[indexSrc[i]];
}

Is System Memory RAM?

I’m not the first to point out that our RAM is not living up to its name, since it’s much worse at random accesses than sequential accesses. Jim Blinn (a researcher at Microsoft) pointed this out in his 2003 article “Fugue on MMX” – asserting that RAM is supposed to mean accessing any byte within memory should take the same amount of time. He pointed out that if a byte being accessed was not in cache, then it takes extra time to bring a cache line into cache.

The measurements presented above show that our current RAM memory hierarchy is much better at sequential accesses than random accesses. Maybe we should start calling system memory SAM – sequential access memory – since it’s much better at sequential accesses than random accesses.

In Cache

Cache memory in most processors is made out of SRAM (static random access memory), which is much better at random access than DRAM. Cache memory is expected to perform better for random accesses. Sure enough, random accesses are only 2 times worse than sequential accesses, for L1 and L2 caches, with bandwidth being around 3.6 GigaBytes/sec and sequential access bandwidth around 7.6 GigaBytes/sec. Random access bandwidth is lower for L3 cache at 2.3 GigaBytes/sec, which is 3.3 times slower than sequential accesses.

Memory Hierarchy

Computer architecture places caches closer to the computational cores than system memory, having all memory access go through cache. With an assumption of 95% cache hit rate, the overall latency would be 0.95 * 0.2nanoseconds + 0.05 * 55nanoseconds = 2.94nanoseconds. However, if an algorithm accesses memory closer to random than sequential, then cache hit rate will be much lower than 95%. For instance, for a 25% cache hit rate, the overall latency would be 0.25 * 0.2ns + 0.75 * 55ns = 41.2ns. This is more than 10 times slower.

Implications

Optimization of computer’s system memory for sequential accesses has implications for some types of computations being perform. Any algorithm using sequential memory accesses benefits substantially, while algorithms using random memory accesses suffer by 25X or more. If an algorithm is data-dependent and data happens to be random, it suffers in performance by more than an order of magnitude.

It would be beneficial for memory designers and computer architects to work on improving memory access latency – to improve random access times. This is the most difficult one to improve, as it’s hard to fight physics. But it may be possible by trading off latency for system memory capacity. The benefit would be enormous, as all computation would improve in performance. no matter memory access pattern, and no algorithm would be left behind.

Even Faster Reads

Accessing memory using C++ with a single core at 20 GigaBytes/sec is good, but is not the best that we can do. My HPCsharp C# library (free and open-source on nuget.org) has a multi-core sum implementation which uses SIMD/SSE parallel instructions and multiple cores to go even faster. Using this library to compute the sum runs at 32 GigaBytes/sec, while using the 6 cores available in the processor with two memory channels. On a 14-core Xeon processor with 4 memory channels, it runs at 54.4 GigaBytes/sec.

One thought on “Memory Access

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s