Approximately what threshold is good for comparing floats/doubles?

Any programmer worth his salt knows that the “==” operator is very dangerous when comparing floating point numbers (except in languages where this is abstracted away for you, though I don’t know of any that do off the top of my head). Rather, what you should do is



if abs(num-expectedValue) < ϵ {
  //Code
}


Where abs is the absolute value function and ϵ is some tiny value. But the question I have is: what exactly is a good value for ϵ? I generally use 1e-7 (1 * 10[sup]-7[/sup]), which in my experience has never given me a false positive or negative, but I don’t know if there’s a generally accepted value for this type of comparison.

Assume that I’m testing the values of two completely arbitrary numbers – I don’t have the luxury of tuning the comparison test to any expected range. Should I use the smallest possible positive non-zero float? A somewhat larger number?

It entirely depends on where your floating point numbers are coming from. What kind of calculation is producing these numbers, and what kinds of inaccuracies can be expected in that calculation?

I’m writing a 3D math library*, which means that I’m basically bound by any potential set of operations that could be performed on a vector, matrix, or quaternion. I suppose one could say that our likely values are that which are likely in 3D graphics (which, at a small enough clock tick between updates, can get fairly small, but are certainly not guaranteed to be so).

  • I’m not reinventing the wheel, I’m writing a package for an immature language that doesn’t have a mature 3D math library yet. In case you’re wondering why the hell I’d do this when, say, GLM exists.

Huh, I learned something new today, apparently this is the hip new and correct way to compare floating points:

It seems… slow, though, with 3 comparisons in the typical case. I mean, not that slow, just more work than it seems like should be done for such an operation.

I’ve used this approach before, and it turned out (AFAIK) well. I hammered out a tiny routine to compare two values with a variable margin. I wasn’t working with massive amounts of data, so the overhead wasn’t noticeable.

Some people decide to get rid of the floats entirely by multiplying them by a known constant. Obviously this isn’t a great approach for all scenarios.

The problem is not comparing the two floats, but the amount of accumulated error that may appear in either. For specific functions you can use a value for epsilon that accounts for the worst case for the circumstance. But when writing libraries the same function may get used in different contexts where you don’t know to what extent inaccuracies have already accumulated in the input parameters. As an example, in rendering geometry your range of equivalency is somewhere around the size of one pixel for a point in question, if you know it at the point in the algorithm. However the same function may be determining the location of a star light years away which is still much smaller than the pixel size. Failure to account for that level of precision can result in a background of blinking stars as any change in perspective occurs. I’ve seen an entire rendering package recast to doubles to account for that. One possibility for you in some cases is to pass epsilon as an optional parameter to allow the calling function to determine the possible accumulation of error. The problem is bad enough in geometry, coloring effects can take it much harder, easily resulting in rastering in the final image or ruining the display of reflective surfaces.

As I’ve noted elsewhere, I’ve often found it frustrating that so few programmers seem to have any idea about these matters.

As some have noted above, it’s hard to choose the right epsilon when, as stated right in the OP, you can’t make any assumptions about the ranges of the numbers you’ll be looking at.

For some kinds of apps, you would know what size numbers you have. I worked with a cash register app for many years that was CHOCK FULL of comparison error bugs like this because the programmers had no clue about these things. The data consisted mostly of dollars-and-cents amounts, encoded as floating-point dollars (e.g., 7 dollars and 43 cents was the float 7.43).

A common type of comparison was: Did customer purchase 10.00 or more, to qualify for a discount? Well, a customer might well have bought exactly 10.00 of stuff but it actually added up to 9.9999999999983 so the register wouldn’t give the discount. But it sure looked like 10.00 on the receipt, since amounts were rounded to two decimal places. ETA: A similar type of comparison, with similar bugs, was common for determining sales tax threshholds, and many other kinds of threshhold cases.

The programmers got into the habit of making such comparisons by converting to a character string with two decimal places (which the conversion library automatically rounded), and comparing the character strings. In my code, however, I used the abs(a-b) < epsilon trick, for which I always used epsilon = 0.0000001

Chronos, on the other hand, pointed out that in some applications, like in his work, if your two numbers are so close together as to raise this problem, then maybe it doesn’t matter which branch the program takes. (See posts 4, 6, 13, 14, 16 for the complete conversation on that point in that thread.)

It’s said that if you can’t solve your problem using only scaled integers, you don’t understand your problem well enough.

This sounds crazy, but in some cases you’re better off doing an integer comparison of your floats. This guarantees that your epsilon scales with the size of the numbers.

For instance:
// assumes input numbers are positive! needs more special casing.
#define EPSILON 5
bool fcmp(float a, float b)
{
int ia = *(int *)&a;
int ib = *(int *)&b;
int diff = ia - ib;
return (diff >= -EPSILON) && (diff <= EPSILON);
}

Just as an example, suppose you compare 1.0 with 0.9999999404 (the largest float less than 1.0). Their internal representations look like this:
1.0: 0x3f800000
0.9999999404: 0x3f7fffff

(0x3f7fffff - 0x3f800000) = -1. So they’re almost equal.

Basically this works because (ignoring signs) floats are in the same magnitude order as their integer representations. Add 1 to the integer representation of a float and you get the smallest number that’s strictly larger.

Heh. I’ve said things like that myself. But floats really are just scaled integers and the same problems are inherent. It’s pretty easy to account for the first time you can’t split a bit in half just using integers, but as the algorithm gets more complex the error can build up. Floats just make it a little more difficult because the precision break isn’t clearly defined and the effects of scale are hidden from the programmer.

Unfortunately, in the language I’m working in order to get an integer representation of a float you have to do Dark Magicks.

Okay, that’s not entirely true, the math package includes a conversion from float32 (or float64) to a uint32 (or uint64), but I’m not sure what its overhead is (it may be more than a function call and a pointer dereference). I may add a function that uses it and benchmark it against the one I linked upthread.

ETA: And unfortunately, I don’t think scaled integers are necessarily an option for a 3D math library meant to interface with OpenGL, though I have used scaled integers before in other contexts to good effect. Though I do allow vectors to be of integer type, so if a programmer wants to try they sure can.

Yeah, it’s probably not worth it if the conversion is too expensive. Hard to say. I don’t suppose you could tell us the language? One of us might know of a trick that you don’t. For instance, JavaScript has fairly opaque data types but I know of a method that would be quite fast.

Go. It has a function in the math package called Float[32|64]FromBits. The only other way I can think of to do it is to import the unsafe package which lets you break language conventions and pretend like you’re C for a minute. But I think it can also confuse the garbage collector if used incorrectly (hence “unsafe”).

Though on review, all it does is basically “cloning the C code with unsafe” like I mentioned, and since it’s in the standard library I assume it’s safe:


return *(*uint64)(unsafe.Pointer(&f))

(I forgot that Go was open source for a minute, and went and just looked up the package code for the math package)

This seems like logic that could be done in a three input circuit. That would fix the speed problems.

The context in which I first heard (and usually hear) the aphorism was in a physics context where I’ve seen people actually use fixed-point numbers to avoid floating point issues (with understanding of the problem required to know where to put the fixed point).

The main problems with floating point numbers aren’t the little rounding errors that show up in calculations involving numbers of about the same order of magnitude. The big losses of precision show up when you mix quantities at widely different scales (like 1e+20 and 1e-20). When the representation can uniformly handle values multiple factors of googol apart, you can get problems easily that the discipline of using fixed-point forces you to think about and avoid.

I recall a product in the days of Intel 386 computers - turn your 386 into a 486 with this simple upgrade software!

Reading between the lines, it was a replacement for the floating point library for Windows. Most versions of the 486’s tended to have the math coprocessor component installed on the chip and flew through floating point calculations, while most 386’s did not have the separate math chip installed and used software to do the calculations.

This replacement software would cut the accuracy of floating point to about 3 or 4 digits instead of 8 or 9, thus cuting calculation time dramatically. Not good for your scientific calculation, but most people wanted faster games. 786x1092 was a high resolution screen in those days, so 4 digits of accuracy was more than enough to calculate positions, trajectories, and other floating point values for games.

So I guess for teh OP, the question is - how accurate does your answer need to be? To show the nuclear missile arcing across the screen, 1 in 10,000 is fine. To figure out the thickness of a wing spar, I suggest 8 or 9 digits of accuracy.

Then there’s the whole field of calculating expanding rounding and error values as calculations progress.

Relevant and highly recommended: What Every Computer Scientist Should Know About Floating-Point Arithmetic.

It’s long and exhausting but an informative read. The moral of the story is, floats are hard. Probably harder than you think. Per leahcim’s advice, if you can solve your problem with fixed-point arithmetic, you should seriously consider it. :smiley:

Using ULPs (units in the last place) is definitely one of the better ways to compare floats, in particular because if you use an absolute epsilon, you run the risk that your “range” of comparison will only contain one representable number anyway! Just be careful that you turn off strict aliasing optimizations in your compiler if you do this, because technically the result of this code is undefined by the C or C++ standards.

I think that you can get around the aliasing issue by using a union of a float and int instead of a pointer cast, though I’m not certain of that. I’ve never actually seen it be a problem in practice, though I know it’s a theoretical issue.

I recall that disproportionate operands were the issue in the blinking stars case I mentioned above. Recasting to doubles was a solution implemented for expediency, but in the end some process of rescaling values was also used to avoid the problems you mention. IIRC, the biggest problem areas in that case weren’t at the most extreme differences in magnitude, in those cases least significant digits weren’t affecting the results, but somewhere in between the hardware and some library functions began to get the widely varying results as a result of the algorithm trying to maintain some of those insignificant digits in the computed result. Your comment is interesting because that system was designed by physicists, who occasionally complained that their slide rule numbers were better than the computed ones. They were commenting on the fact that the knowledge of the scales was important in calculating the results.

Aliasing isn’t implicated, but reading from a member of a union other than the last member written to is also undefined behavior. You can avoid violating strict aliasing by just aliasing to a char* instead of int*, since the char type is the one exception to the strict aliasing rule. In practice, of course, type punning through a union works just like you’d expect it to; I use it myself fairly often. :stuck_out_tongue: