is it theoretically impossible to engineer a 100% reliable solution out of unreliable parts?

And here I thought you were talking about Java2k. :wink:

I certainly had it in mind when I started the thread, but I seem to have stumbled across an honest-to-god real-world implementation of something similar.

(it’s not 90% reliable in the case I have to deal with, it’s more like 99.8% under normal load, but scales non-linearly as more people are trying to do more things at the same time)

try -40C to +85C, except for underhood where it’s -40C to +105C.

There is a way to weasel out of the problem by the way you define your system functions. If you want a calculating machine that “always gives the correct answer”, you need completely reliable parts. However, you can use unreliable parts to produce a calculating machine that, “if it gives an answer, always gives the correct answer”.

For many applications, this is enough. This works only with components that fail in reliable ways, which is why in high integrity design there is a demand for specialist components. Often the concern is not to make them super super reliable, but to make them highly reliable with predictable failure modes.

If you’re dealing with something like software where “failure” is temporary, there is another weasel where you can trade reliable function for reliable timeliness. You can sometimes switch from “always gives the correct answer” to “eventually gives the correct answer”. (There’s a theorem that shows that you can’t always do this, but it isn’t a blanket prohibition on sometimes being able to do it). Similarly, you can trade reliability for accuracy, switching from “always gives the correct answer” to “always gives within a certain interval of the correct answer”.

None of these weasels work for constant functions. If your machine has to always be doing something, then unless your components are 100% reliable it will eventually stop (meaning that there is always a chance that it will stop at any particular time). You can extend the time, and make it fail in a predictable way, but you can’t prevent it eventually stopping.

Design it so that the smallest failure is always catastrophic? Yeah, I think that must be the approach we’ve implemented.

Actually, I’m sort of struggling to see how even that would be possible with universally unreliable components; if you have any kind of evaluation as to whether a function was successful, that evaluation process may err and give a false positive - and the function still tries to give an answer.

That’s not a good solution to the problem in the OP. When you learn about automata theory your learn about synchronizing sequences, which reach a certain state no matter what state you start in, but this assumes your state machine is reliable - which for the OPs case it isn’t. (I learned this before 1974.) You can’t define a synchronizing sequence when you don’t know what the state transitions are.
The linked entry is about something that is good starting in an unknown state.

The answer, as others have said, is that you can get 100% reliability only with infinite levels of redundancy - and then you still asymptotically approach 100% before you reach infinity.

About 25 years ago I saw a presentation by someone from Delco. He talked about life testing of their components. Though he didn’t say it, it was easy to figure out that the life testing was for exactly the time of the warranty. I assume they do better now.

Could you say more about how a programming language can be unreliable? Bugs may look like unreliable components, but that is only because there is an effectively random input data set. Once you find the sequence that sensitizes the bug, it will fail 100% of the time. Standard reliability is about the time to when a component changes behavior, going from working to non-working.
I’ve studied early life failures for processors, and published a few papers about it. It turns out that when you do burn-in the first part of the bathtub curve does not come from traditional reliability problems, but does come from test escapes - bad part which passed all the tests you gave it, but failed the more extensive testing the customer does. This type of early life failure does not go away with burn-in.

I’ve never seen failures on the far end of the bathtub curve. First, the products are way out of warranty, but more importantly they’d be way obsolete. If one did happen the user would probably say “thank God - now I can buy something new.”

I have.

Hard drives that lasted 15 years, and then got “stiction” issues.
Electrolytics that failed after drying out.
CRTs that had their phosphors poisoned.

Oh, I’m not doubting they exist. People who work on space probes see them all the time, for sure. And I’m only talking about our products. My old Saturn was way on the other end.

The main benefit of reliability test for something like a smartphone is to reduce real early life failures and early wearout failures. Not many will be around for the various failure mechanisms to kick in.
But when I worked for Bell Labs our mindset was a lot different.

It happens.

The lifetime of an IC is very sensitive to voltage and temperature. A 10% difference in voltage may mean the difference between a chip that lasts 20 years vs. one that lasts 5 years. That may seem a reasonable price to pay, but 10% can mean the difference between a viable product and a complete flop. So in practice, the chip is run right at the ragged edge of its expected useful lifetime.

Most of the time, this is fine–most customers upgrade before that point. But a few customers hold onto their products for a bit longer than usual, and some might live in a warm climate, and some will just get unlucky and have a chip on the bottom end of the lifetime range. So not everyone is going to be happy, and the chip lifetime is selected to be a balance between performance and not having too many unhappy customers.

I’ve recently been refurbishing a Meade telescope from ~1995. It failed due to some tantalum caps. They were specced at about ~20% over operating voltage and as the years passed they degraded ever so slowly until the margin was too low and they failed catastrophically. Nothing obsolete about the scope, and the parts were fine… until they weren’t.

Failure appears to be an emergent property of broad/intensive use with the system I have in mind. All of the individual parts actually test out OK; any end-to-end test of a combination of parts, when performed alone in a test environment, will work. When 150 users are all trying to do different things, it fails all over the place at random. Retry the exact same thing that failed once, and it works next time, probably.

It’s not a capacity issue in the machinery - all the dials are in green - it’s most likely a collection of memory leak/overflow type issues or badly-scoped/poorly-controlled limits on something like file or record handles somewhere deep inside the database engine.

You’re assuming some sort of checker function, where the output of the main function goes through an evaluation function, and gets blocked if incorrect. As you suggest, that doesn’t work. You need a design where failure of the function itself leads to no answer instead of an incorrect answer. That’s not that hard in hardware, but from your answers to other posts I’m assuming that when you say “unreliable” you also mean non-deterministic, at least from the point of view of the designers. You don’t have insight into the failure mechanisms, so it’s quite "demon in a box"y.

In this case you could only use the weasel solutions I was talking about if you somehow managed to build subsystems out of your unpredictable components such that the subsystems failed predictably. They wouldn’t need to be reliable, just have limited failure modes and predictable interactions. Then you could build functions out of them to make the tradeoffs.

Despite the fact that there are no parts that are 100% reliable we still have plenty of solutions considered reliable enough to implement.

Sure - this thread was more along the lines of whether there was some statistical methodology unknown to me that could compensate for this (although I didn’t expect that there would be)

Non-deterministic is the key here.
Even perfectly reliable hardware is subject to non-deterministic behaviour without extreme effort. In reality is is extraordinarily hard to work around this. (My PhD was concerned with an allied aspect of this, and it is grim.)

Just something as simple as this illustrates the point.

> cc -o hello hello_world.c
> ./hello
killed
>

You can’t write a single program to cope with that. And even modern OS’s don’t.

A while ago an old colleague was having a rant about pet hates in programming languages, and spouting about exceptions (ie throw catch) and how bad they were. Writes he: “a failure is simply the result of a badly written routine or library, they should all be able to be asked if they will succeed ahead of time.” You can’t do it. Even a hard real-time environment the kernel is (should be) designed to degrade as deadlines fail to be met. Temporal non-determinism is present in any real world computer system, and whilst, in isolation, a program, or function, might be provably correct, it doesn’t ever get executed in isolation, and you will never know for sure that it will always execute correctly. At least defensive programming and sensible programming languages and environments can help a great deal, and there is a good chance you can have the system trap the vast majority of failures, you won’t ever get them all.

ACID transactions are an abstraction to help cope, but even then, you can never be sure of a fine fine timing issue breaking the guarantees. But if you code with something like them as your core abstraction you can at least reason usefully.

I understand. I think the approach is to consider the level of reliability that is acceptable. When it comes to engineering there is a limit to what can be done to compensate for unreliability because adding more components tends to decrease reliability.

Same thing here actually, in a weird sort of way - because the problem scales with load on the system, adding more checking processes increases load, which makes the problem worse at the same time as trying to fix it.

First, give up on 100% reliability, because it doesn’t exist. Next, figure out the reliability rate you do need. Third, measure the reliability of the components you have.

Those are the easy first steps. If you don’t know how to do them, then you need to hire someone who does.

You also have to refine your understanding of what reliability means for different operations. For example, when we say “reliable communications” we don’t meaning fuel communication a always works. We actually mean:

  • We are notified if a message might not have made it

  • Delivered data is guaranteed accurate to a known probability of failure (covered above under ECC.)

For other operations the definitions can be different.