Memory Access

We consider computer system memory as RAM – random access memory. Originally, this meant accessing any random location took the same amount of time. This is true of certain kinds of memory types, such as Static RAM – SRAM. In this blog I will show that current computer system memory has significantly deviated from being RAM. This has significant implication on performance, depending on how memory is accessed. We will see how much of difference.

All benchmarks were run on my Dell G5 laptop, which uses DDR4 DRAM (dynamic RAM) memory for its system memory.

Read Performance

Using C++ to compute a sum of an array with 500 Million 32-bit integers runs at 20 GigaBytes/sec, when accessing the array sequentially from its beginning to its end, with delay (latency) of 0.2 nanoseconds.

inline unsigned long sumDirect(unsigned long* src, unsigned long length)
{
    unsigned long sum = 0;
    for (unsigned long i = 0; i < length; i++)
        sum += *src++;
    return sum;
}

However, computing a sum of the same array, but accessing it at random locations within the array, performance drops by 44 times down to 0.45 GigaBytes/sec. Latency, drops by 275 times to 55 nanoseconds.

inline unsigned long sumIndirect(unsigned long* srcA, unsigned long length, unsigned long* index)
{
    unsigned long sum = 0;
    for (unsigned long i = 0; i < length; i++)
        sum += srcA[index[i]];
    return sum;
}

This dramatic reduction in performance is due to DRAM type of system memory, which has been optimized for sequential access over the years, while hardly improving latency.

For computing a sum, only memory reads are used.

Read & Write Performance Summary

Operation Source Sequential Source Random Destination Sequential Destination Random Bandwidth Gbytes/sec Latency nanoseconds
Read Y 20 0.2
Read Y 0.45 55
Write Y 27 0.15
Write Y 0.075 53
Copy Y Y 16 0.26
Copy Y Y 0.26 16
Copy Y Y 0.093 43
Copy Y Y 0.106 38

The above table show how dramatically higher performance is when either source or destination accesses are sequential. Random accesses of either or both the source and destination drops performance in bandwidth and in latency. Latency drops between 60X and 350X.

Write Performance

Using C++ to write to memory sequentially runs at 27 GBytes/sec, with latency of 0.15 nanoseconds.

inline void writeDirect(unsigned long* dst, unsigned long value, unsigned length)
{
    for (unsigned i = 0; i < length; i++)
        *dst++ = value;
}

Writing to random locations runs at 75 MBytes/sec (not GBytes/sec), with latency of 53 nanoseconds.

inline void writeIndirect(unsigned long* dst, unsigned long value, unsigned long* index, unsigned length)
{
    for (unsigned i = 0; i < length; i++)
        dst[index[i]] = value;
}

Copy Performance

Copying reads the source array and writes it to a destination array. Using C++ runs at 16 GBytes/sec, with latency of 0.26 nanoseconds.

inline void copyDirect(unsigned long* src, unsigned long* dst, unsigned length)
{
    for (unsigned i = 0; i < length; i++)
        *dst++ = *src++;
}

Copying from a source array to random locations within the destination runs at 260 MBytes/sec, with latency of 16 nanoseconds.

inline void copyIndirectDestination(unsigned long* src, unsigned long* dst, unsigned long* index, unsigned length)
{
    for (unsigned i = 0; i < length; i++)
        dst[index[i]] = *src++;
}

Copying from random locations within the source array to the destination runs at 93 MBytes/sec, with latency of 43 nanoseconds.

inline void copyIndirectSource(unsigned long* src, unsigned long* dst, unsigned long* index, unsigned length)
{
    for (unsigned i = 0; i < length; i++)
        *dst++ = src[index[i]];
}

Copying from random locations within the source array to random locations within the destination array runs 106 MBytes/sec, with latency of 38 nanoseconds.

inline void copyIndirectSourceAndDestination(unsigned long* src, unsigned long* dst, unsigned long* indexSrc, unsigned long* indexDst, unsigned length)
{
    for (unsigned i = 0; i < length; i++)
        dst[indexDst[i]] = src[indexSrc[i]];
}

Is System Memory RAM?

I’m not the first to point out that our RAM is not living up to its name, since it’s much worse at random accesses than sequential accesses. Jim Blinn (a researcher at Microsoft) pointed this out to some extent in his 2003 article “Fugue on MMX” – pointing out that RAM is supposed to mean accessing any byte within memory should take the same amount of time. He pointed out that if a byte being accessed was not in cache, then it takes extra time to bring a cache line into cache.

The measurements presented above show that our current RAM memory hierarchy is much better at sequential accesses than random accesses. Maybe we should start calling system memory SAM – sequential access memory – since it’s much better at sequential accesses than random accesses.

In Cache

Cache memory in most processors is made not out of DRAM, but our of SRAM (static random access memory), which is much better at random access. Cache memory then, is expected to perform better for random accesses. Sure enough, random accesses are only 2 times worse than sequential accesses, for L1 and L2 caches, with bandwidth being around 3.6 GigaBytes/sec and sequential access bandwidth around 7.6 GigaBytes/sec. Random access bandwidth is lower for L3 cache at 2.3 GigaBytes/sec, which is 3.3 times slower than sequential accesses.

Implications

This optimization for sequential accesses has implications for some kinds of computations being perform. Any algorithm using sequential memory accesses benefits, while any algorithm using random memory accesses suffers by 25X or more. If the algorithm is data-dependent and the data happens to be random, it suffers in performance by more than an order of magnitude.

It would be really amazing if memory designers and computer architects would work on improving memory access latency – to improve random access times. This is the hard one to improve, as it’s hard to fight physics. However, benefits would be enormous, as all computation would improve in performance, and no algorithm would be left behind.

Even Faster Reads

Accessing memory using C++ with a single core at 20 GigaBytes/sec is good, but is not the best that we can do. My HPCsharp C# library (free on nuget.org) has a multi-core sum implementation which uses SIMD/SSE parallel instructions and multiple cores to go even faster. Using this library to compute the sum runs at 32 GigaBytes/sec, while using the 6 cores available in the processor. On a 14-core Xeon processor with 4 memory channels, it runs at 54.4 GigaBytes/sec, which is 13.6 GigaAdds/sec.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s