Faster Merge in C# and C++

Merging of two pre-sorted arrays into a single array is a core part of the merge sort algorithm. In this blog I’ll discuss a couple of high performance implementations for a serial merge, and a way to reduce the number of comparisons.

static public void Merge(int[] a,   Int32 aStart, Int32 aEnd,
                         int[] b,   Int32 bStart, Int32 bEnd,
                         int[] dst, Int32 dstStart)
{
    while (aStart <= aEnd && bStart <= bEnd)
      dst[dstStart++] = (a[aStart] <= b[bStart]) ? a[aStart++] : b[bStart++];
    while (aStart <= aEnd) dst[dstStart++] = a[aStart++];
    while (bStart <= bEnd) dst[dstStart++] = b[bStart++];
}

The above C# compact implementation is simple and efficient, with nearly the same performance as an equivalent version in C++. The main while loop has two comparisons to determine when to terminate the loop, plus one more compare to compare the smallest value from the two pre-sorted arrays.

It is possible to reduce the number of loop termination comparisons from two to one, as shown in the following code:

static public void Merge(int[] a,   Int32 aStart, Int32 aEnd,
                         int[] b,   Int32 bStart, Int32 bEnd,
                         int[] dst, Int32 dstStart)
{
    if (aStart <= aEnd && bStart <= bEnd)
    {
        while (true)
            if (src[aStart] <= src[bStart])
            {
                  dst[dstStart++] = src[aStart++]; 
                  if (aStart > aEnd) break;
            }
            else
            {
                dst[dstStart++] = src[bStart++];
                if (bStart > bEnd) break;
            }
    }
    while (aStart <= aEnd) dst[dstStart++] = src[aStart++];
    while (bStart <= bEnd) dst[dstStart++] = src[bStart++];
}

The above C# less compact implementation is slightly more complex, but is 2% to 11% faster. It reduces the number of comparisons performed in the while loop. Instead of always performing “while (aStart <= aEnd && bStart <= bEnd)” which contains two comparisons, it performs either “if (aStart > aEnd) break;” or “if (bStart > bEnd) break;” as the loop control, bringing the number of comparisons down to one.

In other words, the loop control is conditional. If the smallest value comes from the a array then aStart is compared to aEnd, as there is no need to compare the boundaries of the b array, in that case. Otherwise, the b array boundaries are checked. In either case, only the boundary of a single array is checked.

It’s very simple to extend the above implementation to be generic, to merge any type supporting comparison.

This presents an interesting pattern for looping, which uses different loop termination checking under different conditions within the loop, the benefit of which is higher performance. Wouldn’t it be cool if compilers did this transformation automagically?

Merge in C++

The same implementations for the two serial Merge variations, in C++, are shown below as generic functions. In case of C++, the performance gain from the second version is more dramatic. Performance of C++ is higher than C# for serial Merge, most likely since C# performs extra work to check array bounds.

template< class _Type >
inline void merge_ptr( const _Type* a_start, const _Type* a_end,
                       const _Type* b_start, const _Type* b_end,
                       _Type* dst )
{
	while( a_start < a_end && b_start < b_end )
		*dst++ = (*a_start <= *b_start) ? *a_start++ : *b_start++;
	while( a_start < a_end )	*dst++ = *a_start++;
	while( b_start < b_end )	*dst++ = *b_start++;
}

The same improvement to the merge implementation is implemented for C++ below.

template< class _Type >
inline void merge_ptr_1( const _Type* a_start, const _Type* a_end,
                         const _Type* b_start, const _Type* b_end,
                         _Type* dst )
{
	if (a_start < a_end && b_start < b_end) {
		while (true) {
			if (*a_start <= *b_start) {
  				*dst++ = *a_start++;
  				if (a_start >= a_end)	break;
			}
			else {
				*dst++ = *b_start++;
				if (b_start >= b_end)	break;
			}
		}
	}
	while( a_start < a_end )	*dst++ = *a_start++;
	while( b_start < b_end )	*dst++ = *b_start++;
}

The above C++ implementation is about 13% faster than the simpler implementation. Interestingly, this same clever improvement has recently been added in the VisualStudio 2017 STL implementation. So, you don’t have to code it yourself, as it’s already included! Of course, if you want to extend this clever method to merge more than two pre-sorted arrays, then you’re on your own, for now…

The above C++ algorithm merges two pre-sorted integer arrays producing about 250 MegaIntegers per second on my quad-core i5 6300HQ laptop CPU. Parallel merge in C++, provided in GitHub provides merged result at about 575 MegaIntegers/second. This parallel implementation is described in the following Blog.

More Detailed Performance Analysis

Another way to look at these performance numbers is that two pre-sorted integer arrays of 50 Million integers are merged at 0.2 seconds to produce a 100 Million sorted resulting array, using the serial merge algorithm. This operation reads 100 Million integers and writes 100 Million integers. Each integer is 4 bytes, which results in 400 MBytes/0.2 seconds of memory reads and 400 MBytes/0.2 seconds of memory writes, for the serial merge algorithm. Total bandwidth is then 2 GBytes/sec for system memory reads and 2 GBytes/sec of memory writes.

Parallel Merge algorithms merges two pre-sorted 50 Million integers arrays in 85 milliseconds. 400 MBytes/0.085 seconds for reads and 400 MBytes/0.085 seconds for writes. Total bandwidth is then 4.7 GBytes/sec for memory reads and 4.7 GBytes/sec for memory writes.

To set memory to a value using 64-bit writes at a time using C++ runs at 0.35 seconds to set 100 Million elements. Using Intel Performance Primitive (IPP) library, which uses SIMD/SSE instructions, takes 0.28 seconds. These operations perform only system memory writes, at the rate of 2.3 GBytes/sec and 2.9 GBytes/sec. This is really puzzling, as it is slower than the merge, which implies that a single core can’t use all of the system memory bandwidth even when doing nothing but writing 64-bit at a time.

I’ll experiment writing 128, 256, and 512-bits at a time to see if memory bandwidth can be maxed out using a single core.

Additional Resources

Parallel Algorithms repo and HPCshap C# Nuget open-source package, which have these implementations along with parallel ones, now have a companion electronic book describing several parallel algorithms: Practical Parallel Algorithms in C++ and C#

Algorithm Performance

Measure, Question, Improve, Do It Again…

Faster Merge in C# and C++

Merge in C++

More Detailed Performance Analysis

Additional Resources

One thought on “Faster Merge in C# and C++”

Leave a comment Cancel reply

Merge in C++

More Detailed Performance Analysis

Additional Resources

Share this:

Related

One thought on “Faster Merge in C# and C++”

Leave a comment Cancel reply