is it theoretically impossible to engineer a 100% reliable solution out of unreliable parts?

You have a general purpose set of components - for example, a programming language, but it has a limitation: it’s only 90% reliable. 10% (randomly) of all operations fail to work.

Using only this fallible tookit, can a fully reliable system be engineered?

I think not - because even if we engineer a layer of self-checking to try to catch the failures, that self-checking will only be 90% reliable - so we will only have achieved 99%. A third layer gakes us to 99.9%, and so on. We can only approach full reliability.

Actually, maybe we cant even do that, because the fallibility of the self-checking is going to introduce false negatives too.

Or am I missing some other logical technique? I get that we coukd probably approach reliability to the extent that failure is less likely than the end of the universe, but can we ever reach theoretical 100% reliability, using not-fully-reliable parts?

You can’t no matter how many error checks and compensating factors you string together. If all of them are not completely reliable, it is possible for all of them to fail at exactly the same time in a way that causes the entire device or machine as a whole to to fail. This concept applies to the real world and it explains why no device can ever be made to be strictly 100% reliable. There is always some chance of failure no matter how small. I can’t say I understand it well but there are quantum effects that could theoretically just make the whole device disappear if the subatomic particles that make it up hit exactly the right set of probabilities.

With unreliable parts you have a lot more holes in the Swiss cheese model lining up.

In computer programming, there is a set of algorithms that are self-stabilizing.

Basically if the system is in an incorrect state, it should eventually converge to a correct state if there aren’t any other failures for long enough.

Which converges to perfectly reliable only for infinite time. This is one of the common problems with such reasoning. You can move the problem around, and move it from one domain to another, but it never actually goes away.

I’m not understanding how you can have a 90% functional “programming language”. It’s like saying you have a 90% functional math equation. I can understand programs failing due to coding errors but how can the language itself be “unreliable”.

There’s also the concept of Byzantine fault tolerance.

In general, 100% reliability is impossible, but 99.999999…% reliability for any number of 9s is achievable.

Byzantine problems are a great way of expressing the problems. The way I used to explain this to students is that no matter how hard you try, and no matter how clever, you can beat the problem into a cowering mess in the corner, but you will never kill it. The sun might go cold before you have a failure, but you have never truly solved it.

This does lead to the converse point of view. If the probability of failure is of a similar order of all the other failures - such as random hardware failure (probability of an undetected uncorrectable memory error etc) or the mean time between failures greatly exceeds the expected lifetime of the system, you can decide that the problem is usefully solved.

the “random” part of the unreliability tells me the answer is “no.” If you had known defects in certain components, you could feasibly work around them. e.g. modern CPUs have plenty of defects (errata.) for example, issuing one particular instruction after another particular instruction returns incorrect data or causes the system to hang. once the nature of the fault is known, then an interim workaround is to make sure you never issue the problem instructions sequentially.

but if you don’t know/can’t predict what triggers the flaws or what they are, then no I don’t think you can use it to build a reliable system.

It’s real-world, not theoretically, impossible to engineer a 100% reliable solution out of reliable parts. You put it together and hope it doesn’t break down before the warranty expires. At best, stuff still wears out.

Given the freedom to put any sort of wrapper around the unreliable parts, the solution would seem to be to implement a reliable solution in the wrapper.

Actually, if you are able to write a reliable solution validator, then you can still leave the implementation to the unreliable stuff, though you might need to flip the coin an infinite number of times to wait for a correct result.

I will agree that it’s impossible if the components are unpredictably unreliable.

I once took a control theory class where we studied how imprecise components could be used to build a more precise system if one understood the behavior of the components and their behavior was predictable. But this proposal starts with unpredictable components.

In theory a singlecoin flip has a zero percent chance of 50%, so basically totally unreliable, however over time it will tend more and more towards 50% and at infinity flips we have a party.

Do you have any understanding what kind of reliability testing goes into electronic components? I do, and by the time they go into production the reliability is not a crap shoot. There are thermal cycling tests in various humidity conditions, along with die shear tests, die pull tests, ESD tests, ball sheer, and a few others I’m forgetting. This is a real science. Why do you think cars can run for decades on electronics that survive -40C to + 40C?

Error Correcting Codes solve a limited version of this problem.

The basic idea is that you have an unreliable data channel. Maybe it corrupts 10% of the bits; maybe 49%. Regardless, you can achieve arbitrarily high reliability through use of the right code. Arbitrary does not mean 100%–it just means (like leachim said), that you can have as many nines in your reliability figure as you want.

As much as we’d like to deny it, a huge chunk of computer science is statistical in nature. The entire security of the internet is based on it being hard to guess many-bit numbers. Given enough time, an attacker would break in, but in practice either the universe won’t last that long or the data will be useless by the time the code is broken. Likewise, error correcting codes may fail, but if the rest of the system is much more likely to fail, there is no problem.

One can actually achieve this arbitrarily high reliability without arbitrarily high overhead. One simple way of checking that your data is correct is via a checksum; a kind of merged version of the other bits in your message. A well-built checksum will give a false-positive as frequently as the inverse of its number of possible values. For instance, a 16-bit checksum has 65536 different values, and you can expect that it will falsely claim your message is correct 1/65536 of the time.

Suppose you dedicate half your bits to the checksum value. Then, as your message size increases, your checksum reliability gets vastly better. Each doubling of the number of bits in your checksum doubles the number of nines in your reliability. The 16-bit checksum is 99.998% reliable; a 32-bit checksum is 99.99999998% reliable. It doesn’t take long to get to the point where, if every atom in the universe was sending out messages at Planck time speeds, it would still take the duration of the universe where you could expect a failure.

Then why do they run for only decades rather than forever? Of course companies do everything they can to ensure the reliability of their components. They all can still fail. If nothing else, cosmic rays induce unpredictable failures. With an unlimited budget, it’s possible to reduce the probability of failure to any arbitrary NONZERO value.

–Mark

This only helps if the failures occur in you data storage (or transmission), not in the processor itself.

Yeah–like I said, ECC only solves a very limited version of the problem. Fortunately, logic gates tend to be more resistant to errors than SRAM (used in caches) and DRAM. So putting ECC on your cache and memory is generally a good improvement. If you have to deal with logic errors, you don’t have much choice than to run multiple units and have them vote on the right outcome, and restarting any units that fail (spacecraft generally use this approach).

All I can say is: try working with Unidata sometime.

Although this is a hypothetical/theoretic question, it’s based on a real-world problem I am trying to solve. The fundamental architecture of a system I am trying to support simply isn’t 100% reliable; any given programmatic action (say, launching a subroutine, or writing a value to a database file) can simply fail to happen.

In reality, it won’t be random - it will be driven by hidden factors, but in practice, it might as well be random, because those factors are simple things like the happenstance of the number of users trying to do similar things at once. The software is itself engineered to try to control this, but because the control mechanism is written using the same fallible architecture, it sometimes also fails to prevent failure.

If you are limited to simply using more of the unreliable tools to create that wrapper though…