Performance Limiters

Every computer system whether a laptop, a mobile device, a souped up desktop gaming system, or a distributed system in the cloud or on-premises, has many factors which limit performance. Limits such as memory bandwidth, cache bandwidth, number of cores, clock frequency of each core, network bandwidth, parallel instructions availability and capability, graphics processor computational capability, disk input/output bandwidth and number of transactions per second, random access time of system memory, and so on.

In this post, I’ll explore these kinds of system performance limiters, and ponder how code and architectures can be modified to reduce these limiting affects.

Memory Bandwidth

Modern Intel architecture processor (CPUs) support direct connection to system memory made of Dynamic Random Access Memory (DRAM). The latest generation of DRAM is DDR4, which provides a significant bandwidth boost over DDR3. Intel CPUs support single, dual, or quad channels of memory, directly connected to the processor/CPU. Measuring peak system memory bandwidth is fairly straightforward.

On my laptop, which has system memory of dual-channel DDR4, sequential reads of different size arrays measured bandwidth running on a 4 cores (not hyperthreaded) as shown in the following table and the graph at the the top of this blog:

Array Size (KBytes) Bandwidth (GBytes/sec) Memory Type
102,400 24 System
2048 52 L3 Cache
128 159 L2 Cache
4 263 L1 Cache

For these measurements Intel’s Memory Latency and Bandwidth Checker application was used (see the link below). This application uses SSE/SIMD instructions to measure peak bandwidth available.

When an algorithm accesses memory, bandwidth limits come into play, limiting the performance, depending on how large of arrays are being accessed, which cache level they fit into, and whether the memory accesses are sequential or not. Limiting memory accesses to the lower levels of cache versus system memory leads to 2X to 10X faster memory accesses, for sequential reads.

The following table shows sequential read bandwidth from a single core:

Array Size (KBytes) Bandwidth (GBytes/sec) Memory Type
102,400 18 System
2048 36 L3 Cache
128 49 L2 Cache
4 77 L1 Cache

These measurements clearly show that L1 and L2 cache are not shared between the four cores of the CPU, since bandwidth of the single core scales by nearly 4X to the bandwidth of 4 cores for L1 and L2 caches. However, for L3 cache and system memory the bandwidth bares increases from accesses by one core to accesses by four cores.

Overhead of OS Paging In Arrays

When allocating a new array, the allocation process requests memory for that array, but the OS pages are not created for that array until each page each page has been accessed. At that point the OS makes space for each page in system memory, paging out other applications to make room for this array. This process takes time.

For example, the following table compares the time it takes to set all elements of an array with 1 Billion 32-bit integers to a value, versus setting values of an existing array that has already been brought into system memory by the OS:

Array Allocation Array Size (32-bit Elements) Time (seconds) Initialize Performance (GiBytes/sec)
Newly Allocated Array 1,073,741,824 1.4 3.1
Existing Array 1,073,741,824 0.18 24

The time and performance between these two cases is dramatically different. An existing array that has been brought in by the OS is over 7X faster.

Fast Paging In

One way to page in a new array into system memory is to initialize (or set) each element of the array to a value. For example, if its an array of 32-bit integers, set each element to zero. This is too much work if we don’t need each array element to be zero, and all we want is to bring the array into memory. The minimal work can be performed by accessing a single byte from each OS page of the array. On Windows and Linux, for Intel processors, OS pages are 4K bytes. The table below compares performance of each page in method:

Type of Access Array Size (32-bit elements) Time (seconds) Page In Performance (GiBytes/sec)
Read 1 byte per page 1,073,741,824 1.03 4.2
Set all array elements 1,073,741,824 1.4 3.1

It is slightly faster to access a single byte from each OS page. It’s surprising how slow the Windows OS is at bringing pages into system memory, running at 4 GiBytes/sec out of 25 GiBytes/sec of available system memory bandwidth.

Performance Limits Using Plain C++

The above performance was obtained using SSE/SIMD data parallel instructions, which are capable of operating on up to 256-bits at a time on the i5-6300HQ CPU. However, most C++ compilers do a poor job of turning C++ code into SSE/SIMD instructions, except for the simplest of cases, leaving embedded assembly or intrinsics as the best option for developers. Let’s explore performance limits when using plain C++.

Useful Links

Memory Latency and Bandwidth Checker

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s