Memory Hierarchy Bandwidth

This is a reprint of my Dr. Dobb’s article from May of 2004. Yet, it is still relevant today, with memory bandwidth within the memory hierarchy being critical to reaching full performance potential.

Optimization often demands diving into system memory and processor cache.

Intel architecture CPUs are made of several parts working together to execute code—execution units that actually do the computation, each with its own specialization. One unit is dedicated to integer computation, another to floating-point computation, and another executes specialized instructions that operate on multiple pieces of data simultaneously. The Pentium 4, for example, has two scalar integer execution units that are double-speed, one normal-speed scalar integer execution unit, a floating-point unit, and memory load/store execution units. Several of these units can—and do—run concurrently.

Furthermore, most Intel architecture (IA) CPUs also have two levels of cache that hold the most recently used data and instructions, in the statistical hopes that data near the current access point will also be accesses soon. The Pentium 4 level-one cache is 8 KB and much smaller than the level-two cache, which is 512 KB. The CPU looks for the data in the level-one (L1) cache first. If it does not find it there, then it looks in the level-two (L2) cache. If it does not find it there, then it looks in main memory. The Pentium 4 also has a separate cache for instructions.

MMX, SIMD, SSE, SSE2

Before the Pentium, IA CPUs operated on a single piece of data at a time; see Listing One(a). On the Pentium 4, if these values are in CPU registers, then each of these additions takes 1/2 of a clock cycle to execute. This is amazing—addition runs at twice the clock frequency. In other words, on a 3-GHz Pentium 4, addition runs at 6 GHz—a peak rate of 6 Giga additions per second! These kinds of computations that operate on single pieces of data at a time are generally referred to as “scalar operations.”

Starting with a special version of the Pentium, Intel added specialized instructions for processing multiple pieces of data simultaneously. These instructions fit into a category of operations that are generally called “Single Instruction Multiple Data” (SIMD). In other words, a single instruction is used to operate on multiple data items. These instructions are good at working on four, eight, or even on sixteen pieces of data simultaneously, and are efficient when you have lots of data to process, such as images or large arrays. Intel called these instructions “MultiMedia eXtensions” (MMX); see Listing One(b).

The strange types Iu8vec8, Iu16vec4, and Iu32vec2 being manipulated in Listing One(c) are classes that Intel defined and delivers with its C++ compiler. Using these predefined types makes programming with MMX/SIMD instructions simpler than coding in assembly language. All of these types are 64 bits in total width, which is the width of the MMX integer registers.

In the first line of Listing One(c), eight additions are performed simultaneously (each of these eight additions adds two 8-bit values together). In the second line, four additions are performed simultaneously. In the last line, two additions are performed simultaneously. On the Pentium 4, if these values are in CPU registers, then each of these MMX additions take 1 clock cycle to execute. Thus, on a 3-GHz Pentium 4, MMX additions run at the following peak rates:

For 8-bit additions, 3 GHz×8 additions= 24 Giga additions per second.
For 16-bit additions, 3 GHz×4 additions= 12 Giga additions per second.
For 32-bit additions, 3 GHz×2 additions= 6 Giga additions per second.

In theory, you can achieve substantial peak speedup by using MMX addition instead of scalar additions:

8-bit additions. 24 Giga additions versus scalar 6 Giga additions, for a 4× speedup.
16-bit addition. 12 Giga additions versus scalar 6 Giga additions, for a 2× speedup.
32-bit addition. 6 Giga additions versus scalar 6 Giga additions, for no speedup.

You’ll see this pattern repeatedly for the amount of acceleration that MMX/SIMD instructions bring—the larger the data size being manipulated, the smaller the amount of potential acceleration. The reason for this is simple—the SIMD processing pipeline (unit) is fixed at 64 bits, and the scalar processing unit is 32-bits wide. Within these SIMD 64 bits, a greater number of smaller sized items can be processed in parallel and, of course, fewer larger sized items. This leads to a realization that for the greatest speedup, it is best to operate on the smallest possible items whenever possible and for as long as possible.

A Little Bit of History

MMX was the first set of instructions Intel added to its Pentium and Pentium II CPUs. MMX instructions operated on 64 bits at a time—eight elements of 8 bits, four elements of 16 bits, two elements of 32 bits, or one element of 64 bits. All operations were on signed or unsigned integers.

With the Pentium 3, Intel added more integer instructions, as well as instructions to operate on four elements of 32-bit floating point. Instructions for data prefetching and caching control were also added. Eight registers were added for manipulating these quad 32-bit floating-point values (128-bit registers in width). Intel called these additions “Streaming SIMD Extensions” (SSE).

With the Pentium 4, Intel added a huge number of new instructions for integer and floating-point manipulation. Now, the 128-bit registers can be used to manipulate integers; for instance, 16 elements of 8 bits, eight elements of 16 bits, four elements of 32 bits, two elements of 64 bits, and one element of 128 bits. Intel also added more efficient memory read and write instructions. All of these additions went under the label of SSE2.

Sadly, on the Pentium 4, most of the execution units that process MMX/SSE/SSE2 instructions remained 64-bits wide, probably for economic and power usage reasons. This meant that the majority of the 128-bit SSE2 operations ended up running at 1/2 of the throughput of 64-bit SSE operations.

It is interesting that the same is true of SIMD floating-point instructions (Pentium 3 and Pentium 4), which run at half the peak throughput because the processing unit is only 64-bits wide and thus can process only two 32-bit single-precision floating-point values in each clock cycle. Processing four 32-bit floating-point values takes two cycles. With throughput being 1/2 of the possible peak (one instruction per clock cycle), most of the 128-bit instructions are running at half the clock rate.

By Example

To illustrate, say you want to initialize a block of contiguous memory to a certain value. You would write Listing Two(a) in normal (scalar) C++ code, which writes every byte of the memoryBufferarray—one byte at a time—with whatever value that you specify.

With MMX instructions, you would accomplish the same task as in Listing Two(b), which first initializes a 64-bit value, then uses that 64-bit value to initialize the memoryBuffer array 64 bits at a time. In theory, this MMX implementation should be eight times faster than our scalar implementation. Another way to accomplish the same task is to use the standard memset()function as in Listing Two(c).

Benchmarking performance of these three examples for a 10,000,000-byte memoryBuffer array (on a 2.8-GHz Pentium 4 with a 533-MHz front-side bus, and 800-MHz Rambus system memory) you get:

initMemoryByByte() does 1.2 GB/sec.
initMemoryByMMX() does 1.2 GB/sec.
initMemoryByMemset() does 1.2 GB/sec.

Hmmm…This doesn’t seem right. Where is the 8× speedup? It seems that you aren’t getting any speed up at all using MMX instructions and all implementations are the same speed. Maybe memory initialization does not benefit from MMX instructions at all? How about trying to improve the scalar code by storing more than just a byte at a time, as in Listing Three(a). Benchmarking the performance of these functions, you get:

initMemoryByShort() does 1.2 GB/sec.
initMemoryByLong() does 1.2 GB/sec.

But you’re still not getting anywhere. Okay, now try the Pentium 4 SSE2 instructions in Listing Three(b), which let you write 128 bits at a time to memory. Benchmarking the initMemoryBySSE2() function gives us 1.2 GB per second.

Is there anything else that can be done? Since you are initializing such a large array (10,000,000 elements), the array does not fit into the processor cache, which is 512 KB. Plus, when you initialize memoryBuffer, you touch each of the elements only once and never touch them again. The Pentium 3 and Pentium 4 have the capability of writing data around the cache. Try using this facility in Listing Three(c) and see what happens. Benchmarking initMemoryByMMXaroundCache()gives us 2.6 GB per second.

Finally—improved performance! This implementation is 2.6/1.2=2.2X speedup. But what’s going on here? In all the cases except the last one, the data was first written into the cache, then taken from the cache and into the system memory. This was a two-step process. In the last implementation, the data went directly into the system memory, and was never placed into the cache. This is the main reason for the speedup. Can you go even faster? Try using SSE2 instructions in Listing Three(d) and write around the cache. Benchmarking initMemoryBySSE2aroundCache() gives you 2.6 GB per second.

This 128-bit implementation seems to be going at the same speed as the 64-bit implementation. Could you have reached some other limit? The processor front-side bus runs at 533 MHz and is 64-bits wide (8 bytes). Thus, the peak bandwidth that can be transferred across the front-side bus is 533×8=4.2 GB per second. Also, the system memory subsystem in this machine is an 800-MHz Rambus, which is 32-bits wide (4 bytes). Thus, the peak bandwidth of system memory is 800×4=3.2 GB per second—this must be the bandwidth limit you are hitting, and is most likely the best that you can do. Getting 2.6/3.2×100%=81% of peak possible performance is pretty incredible.

Is it possible to do even better? Since you reached the limit of the current machine, the only way to do better is to change machines. Intel has increased the speed of the front-side bus to 800 MHz, which gives a peak bandwidth of 800×8= 6.4 GB per second. System memory bandwidth has also been improved to 400-MHz dual-channel DDR (two channels of 64-bits wide), which provides 400×16=6.4 GB per second of bandwidth. I benchmarked the same code on one of these newer machines and measured 4.1 GB per second.

Writing to Cache

To this point, I’ve only examined the bandwidth limits of writing to system memory. I did this by writing a very large memory buffer, which does not fit into the processor cache. Now I explore the bandwidth limits of the processor cache by writing a smaller memory buffer, which fits into the processor cache. To do this, I benchmarked all of the implementations once again, but used a memory buffer of 100,000 bytes, which fits easily into the processor cache. The results of the benchmarks are:

initMemoryByByte() does 1.5 GB/sec.
initMemoryByShort() does 2.8 GB/sec.
initMemoryByLong() does 5.6 GB/sec.
initMemoryByMemset() does 12 GB/sec.
initMemoryByMMX() does 9.6 GB/sec.
initMemoryByMMXaroundCache() does 2.6 GB/sec.
initMemoryBySSE2() does 12 GB/sec.
initMemoryBySSE2aroundCache() does 2.6 GB/sec.

These performance results make sense. The processor must have a much faster connection between the execution units and cache. As you write larger data items at a time, the bandwidth goes up by a similar amount; for instance, writing 16-bit items goes twice as fast as 8-bit items, and writing 32-bit items double the bandwidth once again. Writing 64 bits at a time nearly doubles the performance, but writing 128 bits at a time seems to hit a bandwidth limit of 12 GB per second.

Also amazing is that, if you try to unroll the loops in these examples, the performance does not increase. This means that the computational units, most likely working in parallel within each iteration of the loop, are much faster than the memory (L2 cache or system memory) in the Pentium 4.

Conclusion

In this paper, I analyzed just a small portion of the Intel processor architecture—bandwidth of writes to the processsor cache and bandwidth of writes to system memory. The peak performance of these operations was 12 GB/sec. versus 4.1 GB/sec. Armed with this knowledge, you will be able to write code that extracts the maximum available performance.

Listing 1

unsigned char a = 0, b = 1, c;
c = a + b;
unsigned short a = 0, b = 1, c;
c = a + b;
unsigned long a = 0, b = 1, c;
c = a + b;

Iu8vec8 a, b, c; // Iu8vec8 is array with 8 elements of unsigned char
c = a + b;
Iu16vec4 a, b, c; // Iu16vec4 is array with 4 elements of unsigned short
c = a + b;

Iu32vec2 a, b, c; // Iu32vec4 is array with 2 elements of unsigned long
c = a + b;

// Iu8vec8 is really equivalent to "unsigned char tmp[8]".
// Iu16vec4 is really equivalent to "unsigned short tmp[4]"
// Iu32vec2 is really equivalent to "unsigned long tmp[2]"

Listing 2

void initMemoryByByte( unsigned char * memoryBuffer, unsigned long length, unsigned char value )
{
   for( unsigned long i = 0; i &lt; length; i++ )
   memoryBuffer[ i ] = value;
}
void initMemoryByMMX( unsigned char * memoryBuffer, unsigned long length, unsigned char value )
{
   Iu8vec8 initValue;
   unsigned long i;
   for( i = 0; i &lt; sizeof( Iu8vec8 ); i++ )
   initValue[ i ] = value;
   for( i = 0; i &lt; ( length / sizeof( Iu8vec8 )); i++ )
   ((Iu8vec8 *)memoryBuffer )[ i ] = initValue;
   empty();
}
void initMemoryByMemset( unsigned char * memoryBuffer,
 unsigned long length, unsigned char value )
{
   memset( memoryBuffer, value, length );
}

Listing 3

void initMemoryByShort( unsigned char * memoryBuffer,
 unsigned long length, unsigned char value )
{
 unsigned short initValue = (((unsigned short) value ) &lt;&lt; 0 ) |
 (((unsigned short) value ) &lt;&lt; 8 );
 for( unsigned long i = 0; i &lt; ( length / sizeof( unsigned short)); i++ )
 ((unsigned short *)memoryBuffer )[ i ] = initValue;
}
void initMemoryByLong( unsigned char * memoryBuffer,
 unsigned long length, unsigned char value )
{
 unsigned long initValue = (((unsigned long) value ) &lt;&lt; 0 ) |
 (((unsigned long) value ) &lt;&lt; 8 ) |
 (((unsigned long) value ) &lt;&lt; 16 ) |
 (((unsigned long) value ) &lt;&lt; 24 );
 for( unsigned long i = 0; i &lt; ( length / sizeof( unsigned long )); i++ )
 ((unsigned long *)memoryBuffer )[ i ] = initValue;
}
void initMemoryBySSE2( unsigned char * memoryBuffer,
 unsigned long length, unsigned char value )
{
 Iu8vec16 initValue;
 __m128i * memoryBuffer128bitPtr = (__m128i *) memoryBuffer;
 unsigned long i;
 for( i = 0; i &lt; sizeof( Iu8vec16 ); i++ ) initValue[ i ] = value;
 for( i = 0; i &lt; ( length / sizeof( Iu8vec16 )); i++ )
 _mm_store_si128( memoryBuffer128bitPtr++, initValue );
}
void initMemoryByMMXaroundCache( unsigned char * memoryBuffer,
 unsigned long length, unsigned char value )
{
 Iu8vec8 initValue;
 __m64 * memoryBuffer64bitPtr = (__m64 *) memoryBuffer;
 unsigned long i;
 for( i = 0; i &lt; sizeof( Iu8vec8 ); i++ )
 initValue[ i ] = value;
 for( i = 0; i &lt; ( length / sizeof( Iu8vec8 )); i++ )
 store_nta( memoryBuffer64bitPtr++, initValue );
 empty();
}
void initMemoryBySSE2aroundCache( unsigned char * memoryBuffer,
 unsigned long length, unsigned char value )
{
 Iu8vec16 initValue;
 __m128i * memoryBuffer128bitPtr = (__m128i *) memoryBuffer;
 unsigned long i;
 for( i = 0; i &lt; sizeof( Iu8vec16 ); i++ )
 initValue[ i ] = value;
 for( i = 0; i &lt; ( length / sizeof( Iu8vec16 )); i++ )
 _mm_stream_si128( memoryBuffer128bitPtr++, initValue );
}

Algorithm Performance

Measure, Question, Improve, Do It Again…

Memory Hierarchy Bandwidth

MMX, SIMD, SSE, SSE2

A Little Bit of History

By Example

Writing to Cache

Conclusion

Leave a comment Cancel reply

MMX, SIMD, SSE, SSE2

A Little Bit of History

By Example

Writing to Cache

Conclusion

Share this:

Related

Leave a comment Cancel reply