Computing an average or two numbers seems simple and innocent enough of an operation. Here is an example from a C++ computer science text book:
int m = (a+b) / 2
where a and b are also int’s, which are signed 32-bit values on most computers. This code seems innocent enough: add two integers, divide by two, and boom, we have computed an average value. This works in math, why shouldn’t it work in code? Well…
The above code works great for small values of a and b. However, as the sum of a and b reaches the upper limit for a 32-bit integer, things start to go wrong. For example,
a = 2; b = 5; (a+b) is 7
a = 2; b = 2147483647; (a+b) is -2147483647
How is this possible? What kind of goofy math is this?
Finite precision of integers, 32-bits signed integers, in this case, is the reason for this issue. The value 2147483647 happens to be INT_MAX, which is the largest integer value supported by the C data type of int. When we add 2 to INT_MAX, we get an overflow condition, where the result of addition wraps around to negative values. For overflow to happen, the sum needs to end up larger than INT_MAX value. Both a and b, can be quite a bit smaller than INT_MAX, and both can be positive, but if their sum ends up bigger than INT_MAX, then the result will be negative.
Havoc Caused by Computing Average Wrong
The following article is a great and short read which shows how commercial software can have bugs lie dormant for years, with such simple calculations as an integer average of two numbers:
Joshua Bloch, “Extra, Extra – Read All About it: Nearly All Binary Searches and Mergesorts are Broken”, June 2, 2006
There are many ways to fix this problem:
// slightly lower precision, if we care int average = a / 2 + b / 2; // full precision int average = (a / 2) + (b / 2) + ((a % 2 + b % 2) / 2); // full precision, fewer operations int average = a + (b - a) / 2; // sum signed 64-bit, down convert result int average = ((long long)a + b) / 2;
- The first solution is very simple, but loses slight precision, and can be off by 1 from the original code that we started with (at the top of this blog), when both values are odd.
- The second solution, is a clever one, where precision is restored by adding up the remainders.
- The third solution, uses fewer calculations, yet produces an equivalent result to the original code we started with, for positive numbers. However, care is needed to use this solution correctly for unsigned numbers. To avoid underflow, b must be larger than a. For signed numbers the situation is worse, with a solution below.
- The forth solution, is to convert both a and b to signed 64-bit integers, sum them, then divide by 2, followed by down conversion of the result to signed 32bit integer.
The third solution works well in two specific cases:
- both a_i and b_i are positive, such as array indexes, as long as indexes don’t go negative as some implementations allow for negative indexes to happen
- both a_i and b_i are negative
It will cause underflow or overflow when a_i and b_i are of opposite signs.
A more general second solution is:
if ((a_i >= 0 && b_i >= 0) || (a_i < 0 && b_i < 0)) average_ab = a_i + (b_i - a_i) / 2; else average_ab = (a_i + b_i) / 2;
The above solution works for any signed integer values for a_i and b_i.
These are all great solutions for an average of two signed 32-bit numbers, but what about computing an average of more than two values? What about other numeric data types, such as unsigned 32-bit integers, signed and unsigned 64-bit integers (such as C++ size_t), or 16-bit integers (signed and unsigned), or floating-point numbers? Let’s take each of these cases, one at a time.
One interesting property of an average is that it will always be between the largest and the smaller value being averaged. This implies that the result (the average) will always be within the range of the data type being used and will not overflow or underflow.
Several of the above solutions extend well for arrays, and some don’t.
- The first solution does not extend well for large arrays. As the size of the array grows, each value of the array is divided by a larger value, losing substantial precision.
- The second solution extends well for large arrays, as it recaptures the lost precision within the modulus portion of the calculation.
- We will need to revisit solution three at some future point, as there may be something interesting there to explore further. Possibly find the minimum of all the numbers and use that as “a” in the equation and divide by the number of elements in the array.
- The fourth solution is a wonderfully simple scheme of performing the sum, using larger precision values, which hold more bits. The sum can be computed using more bits to avoid overflow. After the sum has been obtained, a single division is performed to compute the average. This is the main algorithm/solution we will explore further, in C++ and in C#.
The fourth solution led to lots of code for Safe Summations in HPCsharp C# nuget package, which is open-source and free. Computing of summations for various data types, safely, has been implemented. Computing of average is also supported.