Do Computers Make Errors of Mathematics?

Well, yes. The whole point of engineering is making compromises to manage what works and what doesn’t. When a device has an error, we engineers shrugging our shoulders and saying “the universe did it” is not often acceptable. We made the device–we’re responsible for its correct operation. We design for specific error rates and the customers certainly document them. A lot of effort is spent understanding errors and the chains of events that produce them. And the root cause will always come down to a engineering choice–the device was designed that way.

Out of curiosity, if an asteroid hits a city and destroys a building in it, would you say that the destruction of the building was due to humans making engineering choices?

Another perhaps more relevant example to the OP’s question - I worked with a system that was built on a rather old database/programming solution (Unidata - basically the loser in the contest for database solutions that was won by SQL). Sometimes it just would fail to do a thing. It was a comparatively rare thing in terms of total operations performed, but still happened enough that there would be multiple failures daily - sometimes it would fail to write a file to disk; sometimes it would fail to populate a database field; sometimes it would fail to properly sum all of the values in an array (i.e. missing one of them at random); sometimes it would fail to execute a line of code in the middle of a program and just behave as though it skipped over it.

The company supplying us with a system that used this framework simply worked around it by adding duplicate operations (i.e. write the value 4 times to the database field, it can’t fail every time!), but the system was huge so they hadn’t done this except in a few mission critical places.
The company supporting the database software blamed the hardware and/or OS, but the problem persisted in exactly the same form across several migrations in which everything changed. The phrase ‘sometimes Unidata doesn’t do a thing’ became a catchphrase for my team.

Our best guess was that there was some overflow condition that caused it to drop an operation under some confluence of circumstances, but we never did manage to pin it down. It wasn’t load related- It could happen when there was only one user on the system.

Undoubtedly a bug, but nobody could fix it, or make any sense of it. Sometimes, Unidata just doesn’t do a thing.

Of course. No different than not making a building proof against earthquakes, floods, wind, termites, etc. That was a choice. Probably a good choice when evaluating the costs vs the risks, and maybe an implicit choice if it wasn’t even considered.

I’d go with: It was made by really bad programmers and the way that it does everything is so poorly designed that it’s just barely able to function.

But, at the end of the day, it is just doing exactly and precisely what it’s been coded to do.

No, I’m just pointing out, to use your analogy, that if putting computers in salt mines could reduce those errors, and we don’t put all the computers in salt mines, human decisions chose to accept errors rather than be limited to computers in salt mines. Obviously, we have decided we can live with some trivial level of avoidable errors if it means we aren’t limited to the number of computers we could fit in salt mines.

If you are saying we can make devices more reliable, I have no argument. That’s what the people at the national lab were doing. If you are saying you can make a device 100% reliable, that’s just not right. You can design an engine to be more or less efficient, but that it is not 100% efficient is not the fault of people.

I worked on parts that went into servers with four nines reliability. I reported on the FITS of our processor to our exec VP - fortunately we always were better than our goal. WE build in redundancy. We built in ECC. We did very good testing at multiple levels of hierarchy. But no matter how many voters and redundancy you put in, there might be cases where it is not enough. That’s not a design decision, that’s reality. Entroy biting you is not a bad design decision.
Now one day my VP asked me to analyze the system impact of making our processor less reliable. We were much more reliable than our spec, and that meant we were leaving cost and performance on the table. We never backed off, but if we did and things failed, those failures would be design decisions. But that’s different from not achieving perfection.

I think rather than barely able to function, it was genuinely 99.9% functional or more, it’s just that the missing fraction of a percentage was composed of random silent failure to do a thing, for no discernible or repeatable reason.

There is an interesting point behind all of this - that it’s impossible to make an infallible thing out of fallible pieces - even if you design-in checks and balances, you have to use fallible pieces for that too; all you can ever do is chase the probability of failure down to smaller and smaller values, but never zero.

I would say that’s issues that result from these types of events are not computer errors. They are property of the universe we live in.

The Unidata issue could easily be vastly more nuanced than your run of the mill bug. It has the hallmarks of a timing related issue, but is likely quite subtle.

I’ll give an example of my favourite bug. Years ago I was writing an object based database system (as part of my PhD.) I found an extraordinarily weird problem. Sometimes, a slab of data didn’t copy correctly. A single byte right in the middle would be missing. Not overwritten, actually missing. The string “abcdefghijk” would copy as “abcdfghijk”. The underlying code doing the work was the standard C runtime bcopy routine. Bcopy itself was implemented with a single instruction, MOVC3. (Well the 68k version of that instruction. ) Somehow a single machine instruction designed to copy data about was failing. It seemed incredible that there could be such a bug in the processor. There wasn’t. But there was a bug. I discovered that other users around the world has also seem it. The Internet was young. So there wasn’t much to go on. Anyway, I found and fixed it.
These things can be really weird and not what you think, and just plan evil.
Computer systems are not simple once you get under the neat abstractions the OS and runtimes provide. Blaming weirdness on bad programmers is not done by people who have been there.

I’m not saying any of those things–it’s not at all what I’m talking about.

I’m talking about error analysis–finding exactly how a device produced an incorrect result. Going back to an earlier example: a building collapsed. This is an “incorrect result” for a building. Analysis determines that the proximate cause is an asteroid smashing into to it. The root cause is the building was not designed to withstand asteroid impacts. (A real analysis will have branching chains of causation.)

There is nothing pejorative about saying the root cause of the error is the design. To repeat myself: this is not because of poor design of the device, but because compromises must necessarily be made.

But this only has any relevance at all if there is a design that can withstand an asteroid impact. If there is not, then it’s not about compromise, it’s about living in the real world.

Relevant to the computers, you can increase their robustness against outside influences like cosmic rays, but you can’t make it 100%.

If you put your computer in a salt mine, surround it with a faraday cage and lead and whatever else would help, make the features of the chip much larger, then you can severely reduce the effect of cosmic rays, but you still couldn’t eliminate it entirely.

You can say that accepting 1 flipped bit a month in a desktop on the surface is a compromise, and that’s useful, as what’s acceptable for doing a Mario speedrun is not necessarily going to be acceptable on a critical piece of hardware keeping people alive in deep space. That piece of hardware needs to be designed to different specifications to have an acceptably low error rate in its environment, but it could never be 0%.

I’ve been a programmer for 20 years, working in everything from drivers to games to big data. There are a lot of small companies trying to make equivalents to the software made by big companies - database solutions, cloud solutions, OSes, etc. - but who can’t offer the benefits of the big companies and need to take on developers that can barely implement FizzBuzz and are now being asked to work in multithreaded logic with interrupts and locks and everything to concern themselves with.

If the program was always bad then I’d bet on bad programmers.

On the reverse end of things, you have long-existing programs like MSN Messenger that started out well-made and then - as the core group left and were replaced by interns and people who didn’t know/understand the code came in to start adding feature upon feature - the thing started to fall apart and get flaky. But it wasn’t broken from the beginning and the problem is less about the quality of the people coming in as the schedule and management of the project.

If you want to say that the underlying system - the OS or CPU was broken - then that was still a bad programmer. But it would only happen on the specific system. Once you migrate to a new system, it should go away. Or, if it was in a library, then the developers should have rolled their own. I had to deal with an issue in PHP’s unicode handling, for generating POP email requests that included foreign characters. I ended up dropping their API and writing my own networking code to do it.

It’s nice to not to have to roll your own but you should be able to recognize when it’s necessary and you should be able to do it if the issue isn’t in a place that you can fix.

At 99.9%, maybe it’s reasonable to blame edge cases. Your original description made it sound like the thing was generally flaky and that it was pretty easy to get a failure in any particular run.

You could have both.

If it is 99.9% reliable, but it runs a few thousand iterations each run, then it would still be likely to have a failure or two in each run.

But again, if it’s every run then that should be apparent during initial testing or at least have gotten back to the developer almost immediately. It shouldn’t stay an issue for more than a couple of weeks - maybe a couple of months at max - let alone for years on end.

Fair, and intermittent failures are the worst to diagnose.

Good point.

What’s more remarkable, to me, is the exact opposite: that it is possible to make arbitrarily reliable things out of arbitrarily unreliable things. If you have a radio link, for instance, there is no amount of noise which will prevent you from transmitting bits with as much confidence as you like. The only effect is to reduce the bitrate. For other cases, such as memories, the effect is to reduce the effective capacity, but what is left can be as reliable as you want.