Computing an average or two numbers seems simple and innocent enough of an operation. Here is an example from an Algorithms text book:
int m = (a+b) / 2
where a and b are also int’s, which are signed 32-bit values on most computers. This code seems innocent enough: add two integers, divide by two, and boom, we have computed an average value. This works in math, why shouldn’t it work in code? Well…
The above code works great for small values of a and b. However, as the sum of a and b reaches the upper limit for a 32-bit integer, things start to go wrong. For example,
a = 2; b = 5; (a+b) is 7
a = 2; b = 2147483647; (a+b) is -2147483647
How is this possible? What kind of goofy math is this?
Finite precision of integers, 32-bits signed integers, in this case, is the reason for this issue. The value 2147483647 happens to be INT_MAX, which is the largest integer value supported by the C data type of int. When we add 2 to INT_MAX, we get an overflow condition, where the result of addition wraps around to negative values. For overflow to happen, the sum needs to end up larger than INT_MAX value. Both a and b, can be quite a bit smaller than INT_MAX, and both can be positive, but if their sum ends up bigger than INT_MAX, then the result will be negative.
There are many ways to fix this problem:
c = a / 2 + b / 2; // slightly lower precision, if we care
c = (a / 2) + (b / 2) + ((a % 2 + b % 2) / 2); // full precision
c = a + (b - a) / 2; // full precision, fewer operations
c = ((long long)a + b) / 2; // sum signed 64-bit, and down convert result
- The first solution is very simple, but loses slight precision, and can be off by 1 from the original code that we started with (at the top of this blog), when both values are odd.
- The second solution, is a clever one, where precision is restored by adding up the remainders.
- The third solution, is an interesting one, as it uses fewer calculations, yet produces an equivalent result to the original code we started with.
- The forth solution, is to convert both a and b to signed 64-bit integers, sum them, then divide by 2, followed by down conversion of the result to signed 32-bit integer.
These are all great solutions for an average of two signed 32-bit numbers, but what about computing an average of more than two values? What about other numeric data types, such as unsigned 32-bit integers, signed 64-bit integers, or 16-bit integers (signed and unsigned), or floating-point numbers? Let’s take each of these cases, one at a time.
Several of the above solutions extend well for arrays, and some don’t.
- The first solution does not extend well for large arrays. As the size of the array grows, each value of the array is divided by a larger value, losing substantial precision.
- The second solution extends well for large arrays, as it recaptures the lost precision within the modulus portion of the calculation.
- We will need to revisit solution three at some future point, as there may be something interesting there to explore further.
- The fourth solution is a wonderfully simple scheme of performing the sum, using larger precision values, which hold more bits. The sum can be computed using more bits to avoid overflow. After the sum has been obtained, a single division is performed to compute the average. This is the main algorithm/solution we will explore further, in C++ and in C#.