Do Computers Make Errors of Mathematics?

Brute force calculating is right up a computer’s alley, so that kind of function plays to its strength. Computers don’t make “mistakes” as we define the term. Their errors are a product of hardware failure or programming errors.

Sure. But it isn’t possible to design them not have such errors. Not in the universe we live in. Engineers compromise - which is not a pejorative - it is a fact of life. Bounds are set on what is acceptable. These bounds balance a whole range of factors, including expected errors and use cases. Mitigation strategies exist, and can be applied as needed, but in balance with other constraints. None are 100%.

Entropy will always win. You can use a big stick and beat it down into a cowering mess in a corner, but it will never go away.

Eventually we can design systems that have error rates that are vanishingly low - if we want. What is being understood at the moment is that we probably do want to do better than we were. This is just shifting the requirements to a new compromise. But perfect isn’t possible.

I was in the Unix compiler group in the 80s and 90s. During the switch to Alpha we transitioned from the Berkeley-based C compiler (cc) to the GEM compilers that came out of the TLE group from VMS. But the old cc was still used to compile the system for a while before we eventually were able to bootstrap. Somewhere in there you certainly used code that I touched, although even then I rarely wrote new code as I was switching to project management.

Once I was warned about potential compiler bugs, so I asked the guy which compiler he would trust, if any. He replied, the one he wrote himself for the Motorola 68030.

I would like to believe that things have changed for the better since then, with thought-out IEEE standards and compilers that are known to conform to them, but the truth is I do not know. Are there still regular stories about computerized systems being bitten by floating-point bugs?

Probably apropos of not much:

You’ve explained my point much more eloquently than I did. I’m pushing back against “the design is fine, so no human error”. A better statement is “the design is fine, as humans intentionally put in that capability of error”. The root of cause of the error is the design.

There are cases where the deployed environment simply can’t be guaranteed to be the same everywhere - and some components in the deployed environment, whilst notionally identical in function, turn out to be different. One case I dealt with around the turn of the millennium was to do with drivers for a database file format - the file format was designed in the late 1970s and one of the first few bytes in the file was supposed to contain a number representing the number of years into the current century (i.e. a value of 79 for 1979).
Lots of different software (for example Excel) had its own implementation of the standard, so that they could read and write compliant files…
Except some of them calculated that value as the number of years since 1900, not from the start of the current century, which didn’t matter until the century changed and the format was still in use. Some implementations were (correctly) writing a value of 1 in the year 2001; others were writing a value of 101. Suddenly the files stopped being intercompatible on different platforms, or between different programs on the same platform.

I suppose this is a human error, not a computer error, but still something that nobody seems to have thought to test, probably because the people who would have cared weren’t thinking that far ahead, or just failed to imagine the problem.

A computer will probably not be giving you the correct answers if it is on fire either. That isn’t to say that “humans intentionally put in that capability”.

In order to entirely prevent cosmic ray flips, you’d either have to have massive shielding, (even putting it hundreds of feet underground wouldn’t eliminate 100%), or replace it with something mechanical. I doubt that a cosmic ray could flip a “bit” on an abacus.

But humans designed the level of shielding from cosmic rays and designed the systems that supply power to the computer. Even if it’s radioactive decay of the materials in the computer, humans decided which radioactive materials to use and failed to account for how decay could affect the computers’ results. It’s humans all the way down.

I feel like all of the aspects of the OP’s question have been answered but not in a compact and clear way.

First, in general, computers are very consistent. If you ask a computer to add 5 + 4 a billion times, you’ll get 9 all billion times…usually. The only exception to this is if there’s a solar disruption, the storage medium for the calculation is corrupted (e.g. because of water logging, age, physical damage, etc.), or something of that sort. SImilarly, if you have a wheel connected to a crank at a 1:1 ratio then a full turn of the crank will turn the wheel a full 360…usually. If the crank breaks off, the gear loses a tooth, etc. then obviously that won’t hold.

Secondly, math has a lot of numbers that are impossible to represent as numbers. Pi can be shown to a certain number of digits and then you need to give up. 1/3rd - if written as 1/3rd - is something we can show perfectly but if we convert it to decimal notation like 0.3333333 then, again, we have to choose a place where we cut off and have an incomplete approximation of the true number.

If you’re doing a bunch of math, you can do things where you try to preserve 1/3rd as a fraction all of the way through until you’ve figured out the full formula, reduced it down to its simplest form with the fewest steps, and then try to calculate out the end result. But you can also do the math as you go, immediately converting to 0.33333333 and multiplying by 5.423 right at step 1 and then continuing on to step 2, 3, 4, and on.

But eventually, in real world applications, you always need to output a real number. If someone needs to know how long to cut a piece of wood, they need a number like 4.785645 and not a formula like (487 + 1/3)^(4/8 * PI). When that’s true, you’re going to end up using some incomplete, truncated and rounded approximations of PI, 1/3rd, etc. in your calculation. There’s no way to get from a mathematical representation of a set of calculations to an actual decimal number, without converting everything to decimal numbers and performing some operations between them. Depending on how you go about that and how many steps there are, you’ll end up with drift away from the true result.

Ideally, the person who is setting up the program understands how many calculations there are going to be and how much drift that could end up introducing, so they will adjust how many digits of precision to include in the math so that the end result is accurate to the level needed for the job. If you approximate PI as 3 then you’re going to get a wrong result pretty quickly. At 10 digits of PI, for simple calculations, you might be safe up to 5-6 digits in your output. For really complicated stuff where you’re multiplying by PI hundreds of times, iteratively, and you need your final resulting number that you need to use to build your space rocket from out to 30 digits, then you might need an approximation of PI that’s thousands of digits long and all of your intermediate values need to be that big, too. Your final result will be thousands of digits long but you’ll know that everything after the 30th place is junk that can’t be trusted so you crop it off and only show the top 30 numbers.

That’s not really a computer problem, it’s just a component of how math works in the real world.

The thing with computers, though, is that 1) they represent numbers in binary so it’s a lot easier to get numbers like 0.33333 that have no clean cutoff, and 2) the default ways of dealing with numbers involve 32-bits of precision (similar to 9 digits of precision) or 64-bits (20 digits of precision). So if you’re not being careful, you’re likely to be impacted by one of those limits. A programmer can use libraries that allow you to define how many bits of precision you’d like to have for everything, but usually we don’t since the overlap between programmers and mathematicians is fairly small.

This is still, ultimately, human error more than it’s computer error. But a computer can be made to do the wrong thing since it, ultimately, just does what you tell it to do. You can tell it to do something stupid and it’ll do it without complaint. Computers have the reasoning ability of a crank attached to a wheel. If you turn it so fast that the materials can’t stand the strain then…well it’ll break and that’s all there is to it. It’ll never sprout a mouth and yell at you, “Hey, dummy! Don’t do that! Newton would be sad, that you think that’ll work!” It’s just going to let you break it.

We’re talking about the laws of physics here, not bad design decisions. An ideal Turing machine or a computer running under simulation doesn’t have SDCs. Any physical implementation has to be concerned with reliability.
We’ve been worrying about soft errors for years, and I’m not surprised they finally caught up with us. I’m glad I’m retired so I don’t have to worry about them.

So, we’ll put all our data centers deep in salt mines?
I worked with people at a national lab, who had their own fab to build chips going into a high radiation environment with a technology several generations back and design rules that take that into account. I’m sure their stuff wasn’t perfect. (I didn’t have the clearance to learn the details.) I do know the people who did failure analysis on some of their fails.
I guess an abacus is immune to SDCs.
The only people talking about 0 FITS as a goal are clueless managers and salespeople. You can do well, you can beat your goals, but you can never be perfect. And we had the advantage that we stopped tracking when the warranty ran out.

I swear I wrote my comment about the abacus before I saw this!

Thank you. Thats a great summary.

Could you please also talk about Random Number Generation. What I have heard is that computers (or human algorithms implemented in computers) can only generate pseudo random numbers and not true random numbers. Is that true ?

If two computers are asked to generate a large number of random numbers, using the same algorithm, will the set of numbers have the same statistical properties ? For example - will both the data sets have the same kurtosis ?

True unless you use a trick like using the last few digits of a random temperature from a thermometer or something with lots of precision but not so much accuracy.
Chips sometimes get tested with pseudorandom patterns generated by on-chip linear feedback shift registers. Definitely not truly random. I’m sure there are newer references, but the first (I think) volume of Knuth goes into this in great detail.

Agreed.

A device working as designed may produce an error. The proximate cause of the error may be a cosmic ray or whatever. The root cause is the design, where the device was chosen to be vulnerable in that particular way. And probably for a good reason, but the device still produced an erroneous result and it’s due to the choices humans made.

Only in the sense that humans choose not to go faster than light.
Devices age. You can mitigate this by burning in the devices and moving them up the bathtub curve, but the bathtub curve rules.
Now sometimes reliability failures come from human error, but not always.

Yes and No.

The traditional and original way of creating a “random” number was to take the current time and run some math on it that produced a number that seemed random. When you would ask for a new random number, it would take the previous number and do the same math on it that it did against the current time the first time.

You could also put in your own starting (seed) number and start the process. This allowed you to exactly recreate the circumstances of your program (like to get the same starting configuration of Solitaire that you’ve been trying to solve, in Windows).

There’s a mathematical relationship between all of the numbers that are spit out of this process and, if you know what time the program was run, you might be able to figure out what the whole sequence was. (This has some software security implications.)

Since the early days, however, we’ve added two strategies to the bag:

  1. Specialized hardware. If you add a device to the computer that can do something like measure the ambient temperature out to 100 decimal points then the smallest digits of that are effectively random. Someone might reasonably guess that your computer was sitting in a room around 75 degrees but there’s really no way to predict what the temperature is out to 100 digits and it’s always fluctuating randomly with the smaller digits fluctuating more randomly the further out you go. Mathematically, you can even than out and run some formulas that create a 64 bit number where every bit is equally likely to be any particular value, with no way to predict how it will go from reading to reading.

  2. Old computers used to operate where you’d have a program onto a disk, put the disk into the computer, turn it on, and it would run the program on the disk. When it was running, the one program was basically all it was doing. But for a long time now, we’ve had an OS that runs when the computer turns on and not only can it run programs without having to turn everything off to make room, it can even have multiple programs running all at the same time. Once we added internet connections and background processes like file system optimizers, automatic time synchronization, etc. we’ve basically got a whole load of stuff happening on the computer, all the time, and all unconnected to one another. Somethings always happening, whether it’s random packets of information on the network that we’re having to scan to see if they’re intended for us, to the user moving their mouse around, to scheduled virus scans starting.

The people who write the OS (Windows, Mac, Linux, etc.) can set up a process that simply watches all the things going on around the computer and keep track of silly metrics like what direction the mouse last moved, how many processes are running, how many packets of information were looked at in the last hour, what temperature the CPU is, etc. Some of these aren’t terribly random but, once you multiply them all together, you get to something that’s effectively random without having to have added special hardware to the system.

OSs now provide functions that provide a true random number.

I’m not sure but I believe that the OS functions perform a mix of sampling the hardware values and running pseudo-random formulas on them, in order to increase throughput. If so, they should document whether the formulas that they use are secure enough to use in cryptographic applications.

It’s going to depend on a variety of things. You can still do it the old fashioned way but there are new, better ways. You can use a starting seed or use hardware randomization. You can use old, horrible formulas and new, cryptographically secure ones.

It depends on what you need. In general, the more perfect you want it, the longer the amount of time needed to create a new number.

Humans choose the design of a device. Humans choose the conditions the device is operating under. Humans choose the error rates of the device under those conditions. The errors are a result of human choices.

This is not because of poor design of the device, but because compromises must necessarily be made.

Yes, uses electrons in wires rather than beads on a stick.

Yes, this universe.

Most computers should expect to see a random bit flip a few times a month, hard to track down a specific number. Humans have built in a fair amount of error correction into these systems, so most of those are corrected.

To eliminate 100% would definitely not be practical, and probably not even possible, no matter how much you wanted to.

Right, the choice to have electronic computers rather than abacuses.

It looks to me like what you’re saying is true in a sense, but it is not useful. It’s sort of the equivalent of a non-falsifiable statement. What would it mean, in your mind, for an error not to be caused by humans? What would such an error look like?

You seem to just be saying that, for anything made by humans, any error or problem involving that thing must, by definition, be human error.