Rump's Royal Pain

Here's something cool I found. I'll explain later. For now, I simply want you to solve the following equation when x = 77617 and y = 33096.

333.75y6 + x2(11x2y2 − y6 − 121y4 − 2) + 5.5y8 + (x ÷ 2y)

What did you get?

I bet you used a computer, didn't you... Well then your answer is almost certainly wrong. For example, WolframAlpha thinks the answer is: 1.18059 × 1021. Yeah, not even close. It's off by roughly how many grains of sand exist on all the beaches in the world. I mean, it didn't even get the sign right! The correct answer is roughly −0.827396. Don't trust me? I've shown my work in Appendix 1.

For now though, assuming you'll trust me, let's look at why this happens and in fact why computers are so bad at computing things like this. We can then look at possible solutions to this problem.


IEEE floating point strikes again...

I assume you've heard that floating point numbers have a precision problem. If not, Tom Scott's got a good video for the uninitiated. Quick refresher, if you try and represent something like $5.26 using IEEE floating point, you're going to have a bad day.


Sure, they've got some rounding errors, but how on earth can it go so horribly wrong in just a dozen or so operations?

Well, real numbers have a very large infinity of values. This is a finding many people learn early as they discover the possibilities in infinities. To recap, there are both infinitely many positive and negative integer numbers (..., −2, −1, 0, 1, 2, ...), but there are also infinitely many real numbers between just 0 and 1 (0, 0.5, 0.75, 0.875, 0.9325, ..., 1) and this is true between any pair of integers in the reals. Succinctly, some infinities (all the real numbers) are bigger than others (all the integer numbers).

When it comes to computers though, you're not getting infinite hardware. Especially for a primitive like a numeric type, you're only going to get so many bits. A typical desktop ISA like x86-64 uses the IEEE 754 float64 standard (also known as a double). This standard deals with the limitation of 64 bits by converting the number to binary scientific notation and then allocating:

That means you can only represent numbers up to 253 accurately. It's 53 and not 52 because we don't store the lead 1 of the mantissa since it's implied, and thus we gain a bit for free. For example, 1.0101001 × 21101 stores the mantissa as 0101001.

Let's calculate the largest value we can store without rounding against the largest term in our equation:

= 253
= 9007199254740992


= 776172(11 × 776172 × 330962 − 330966 − 121 × 330964 − 2)
= −7917111779274712207494296632228773890

Yeah, seven undecillion is much bigger than nine quadrillion so there may be a bit of rounding. You can actually see this in action if you try printing these numbers with Python. At the 16th digit things start to get messy.

print(f" {9007199254740991.0:.50}")
print(f" {9007199254740992.0:.50}")
print(f" {9007199254740993.0:.50}")

Once your number is past the information limit for a given amount of hardware, you've got to start rounding. You can calculate how much we have to round given a certain number of bits using the equation:

±2⌈log2(|num| + 1)⌉ ÷ 2 ÷ 2bits

In our case, num = −7917111779274712207494296632228773890, and bits = 52, which when plugged into the above equation equals ±1.18059 × 1021. Suspiciously, that's exactly what the computer thinks the answer is. I guess that means the answer is legitimately within the allowed rounding error. Hurrah, it's technically correct. I don't know about you, but I don't really like that answer, and it turns out I'm not alone.

What Now?

At this point you at least know why your computer is lying. It's not, there's just an unspoken contract involving numerical precision based on the scale of the numbers involved. You're more than welcome to compute the precision yourself but the hardware's not going to do it for you. You've also got to account for any error introduced as the result of chained operations. But in theory, this is all calculable.

Then why does it feel like a lie? Well, I'd argue it's because there's an implicit speed versus accuracy trade-off we're making. To elaborate let's briefly look at how integers handle this exact same problem.

Integer Overflow

Integers have the exact same finite hardware problem. You can only represent numbers so big in an integer of a fixed size, but integers chose to be explicit in their overflow. When you overflow an integer (analogous to exceeding the precision limit of a float), you get a status flag set by the CPU which you can check and act on. You get extremely fast computations with a slower precision guard when you need it.

I actually think it's that last part that sticks out in the solution. Sometimes speed matters, sometimes accuracy matters. Having the choice is powerful. Programming language designers have various opinions about what makes the most sense for their users. Some languages leave the choice to you, some always check for overflows and raise a runtime exception for you to handle, and others just automatically bundle a bignum library to make sure you get the correct answer at the cost of a much slower calculation.

Python for example uses that last method, automatically using bignum arithmetic. The example earlier printing the number as in integer didn't round the number like it did when we appended the .0 thanks to built in bignum support. It does this without the programmer having to worry about it. It just works. It's much slower than a regular integer, but how often do I really use such huge numbers. I'm not saying this is always the best answer, just noting it's a common one.

About that overflow flag. Can I have hardware fast computations that don't occasionally give me incorrect answers within some unspecified range of precision?

Interval Arithmetic

Sadly, a hardware flag letting you know it rounded the answer isn't really helpful as most floating point numbers get rounded. What about rounded to more than your allowed precision? Well, you'd have to constantly specify what precision you accept for every operation and it still wouldn't capture the error that accumulates as you manipulate them. It's not off the table, but I don't know of anyone pursuing this. There's another way though.

There's a field of research in computational mathematics called interval arithmetic. When you can't really sacrifice speed or accuracy, what if you're fine with answers that don't have a value but instead specify a range of possible values they could be? I mean, that's exactly what traditional floats are doing anyway, you've just been assuming the applied rounding to the interval was close enough to not matter.

For example, what if 0.1 + 0.2 = (0.29999542236328125, 0.3000030517578125)? That is, the answer is somewhere in the interval. It's definitely not as nice as saying the answer is 0.3, but your computer can't represent that using a base 2 scientific notation number anyway. It's always been rounding the value 0.29999542236328125 ± ~7.62939453125 × 10 −6, but now it's spelled out for you. You get to decide if that's close enough to the answer you wanted it to be. You can even apply the rounding yourself if you want. You just get to do so explicitly when the situation calls for it.

In 2015 a numerical type called a universal number (unum for short) was published by John Gustafson as a way to computationally represent exactly this concept. I'm not going to say something over the top like, "Everyone should stop using floats and use unums instead." That's absurd. They are fascinating though and worth checking out if you think your problem warrants this sort of guarded precision at the cost of more complex semantics and a slower runtime. I say slower runtime because they're only available in software libraries or accelerator cards priced at contact us™ (if you have to ask you can't afford it).

If you're curious about unums, there's a talk by Ferris Ellis about them at Strange Loop covering a bunch of the basics. It's honestly worth a watch even if you're never going to use them. The second half is a bit under informed and over extrapolated but while he's explaining the basics of unums it's a fairly useful overview.

Other Alternatives

While intervals are very interesting, they're also arguably a lot more work to understand and use. Some applications may be best expressed using them, but if you're using a library anyway and accuracy matters more than speed, there are bigfloat libraries. For example, gmpy2 is a Python library that provides bindings to libmpfr. It's easy enough to use and results in generally familiar code:

import gmpy2

with gmpy2.local_context() as ctx:
	ctx.precision = 125

	x = gmpy2.mpfr(77617)
	y = gmpy2.mpfr(33096)

	x2 = x * x
	y2 = y * y
	y4 = y2 * y2
	y6 = y4 * y2
	y8 = y4 * y4

	ans = (
		333.75 * y6
		+ x2 * (11 * x2 * y2 - y6 - 121 * y4 - 2)
		+ 5.5 * y8
		+ (x / (2 * y))


The only interesting thing is the need to define the precision you want in a local context. You can globally set the precision but that's probably not a good pattern given the precision heavily impacts the performance of the operations, though performance may not be as bad as you might expect. For example, the gmpy2 code above is only about 4 times slower than the floating point version of the same code.

Rump's Example

While I'd like to take credit for finding the example I lead with, It's actually a fairly old example in the literature of floating point and interval research. It comes from Siegfried Rump in his paper, Algorithms for Verified Inclusions: Theory and Practice. Published in 1988, it's kind of dated at this point. It's noteworthy for the time because the IBM System/370 yields an incorrect answer that's stable even with more and more precision. Running the same calculation with different precisions is sometimes enough to determine if you've got a precision problem in your calculation. In this paper though Rump showed it would only result in what looked like a more and more precise incorrect answer.

Today the same equation shows precision instability because most computers use the IEEE 754 floating point standard. You can still reproduce the result though using a paper published by Eugene Loh and G. William Walster of Sun Microsystems. They showed you can achieve the exact same stable inaccuracy with common floating point sizes (32 bits, 64 bits, and 128 bits) with only a slight reformulation of the original equation.

(333.75 - x2)y6 + x2(11x2y2 - 121y4 - 2) + 5.5y8 + (x ÷ 2y)


If you don't remember that floating point numbers have a precision caveat, they will bite you (even in languages like Python, Matlab, and R). Libraries can help but still require things like manually picking a precision or the use of interval arithmetic. If you've never heard of unums, check them out. Mostly, I hope this has given you a deeper appreciation of the engineering trade-offs inherent in things even as fundamental as floating point numbers.

Appendix 1: Showing My Work

To show my work I could have taken the path of simplifying the equation algebraically. Instead, I've taken to brute force computation by parts to make it as easy as possible for you to follow along. This results in some crazy big numbers, but rest assured, you can plug each of the lines below into a calculator like WolframAlpha and follow along to triple check I'm not cheating. Let's start by spreading the equation out into a series of sums.

= 333.75y6
+ x2(11x2y2 − y6 − 121y4 − 2)
+ 5.5y8
+ (x ÷ 2y)

Separate the whole numbers from their fractional component. Remember from algebra that we can combine the coefficients of like terms when adding. For example 2x + 4x = 6x.

= 333y6
+ 0.75y6
+ x2(11x2y2 − y6 − 121y4 − 2)
+ 5y8
+ 0.5y8
+ (x ÷ 2y)

Let's rewrite the equation to give each of the parts we sum a new variable. This will help us keep track when we rearrange them in a later step.

  = a + b + c + d + e + f
a = 333y6
b = 0.75y6
c = x2(11x2y2 − y6 − 121y4 − 2)
d = 5y8
e = 0.5y8
f = (x ÷ 2y)

Now we compute each of these parts. You can do this in WolframAlpha if you're following along. Just be sure to keep all the precision you can. I truncated f to a few decimals just to keep the math a bit shorter when it's all lined up.

  = a + b + c + d + e + f
a = 437620119945614750330859552768
b = 985630900778411599844278272
c = −7917111779274712207494296632228773890
d = 7197373946062692146455577001386311680
e = 719737394606269214645557700138631168
f ≈ 1.172604

That's basically it. Now rearrange and add them up.

c   −7917111779274712207494296632228773890
d +  7197373946062692146455577001386311680
e +   719737394606269214645557700138631168
a +         437620119945614750330859552768
b +            985630900778411599844278272
f +                                      1.172604