Single point of failure

More about failure:

In the software world, at least, it’s often the case that we want software to fail early: If something goes a bit wrong, die then and there while all of the data is still fresh and a good postmortem can be done and, more importantly, die before the error can cascade. A classic example is malloc() failing in C code: malloc() allocates memory. If it fails, you don’t have the memory you think you have, but unless you check for failure and fail early the inevitable crash may occur a long way away from the initial failure and the passage of time is likely to have destroyed all of the state needed to debug the crash. Worse, the crash might have caused other, random damage.

This seems like an intentional SPOF and it is, indeed, a Good Thing in many kinds of code.

At the risk of starting a debate, I understand where the OP may be getting the first definition. I’ve heard discussions on ways to decrease the cost of testing of safety critical or mission critical systems. One theory is that you focus your testing on the failures most that are most probable. That way, you get the most bang for your testing dollar. I can see where this theory could be stretched and applied to system designs where the most probable component failures would be placed into (or out of) a critical path so that the intended system behavior was achieved. In other words, a single point failure of the component that is most likely to fail would cause the system to fail safe.

Personally, I don’t buy it. I would rather view failures in terms of their consequences. I don’t care if a component’s failure rate is less than 10e(-6), if the consequence of the component’s failure is the loss of a life, then there had better be some redundancy for that component in the system.

You can’t design out all SPOFs. All you can do is reduce their number & likelihood below some threshold for some price. e.g. Airplanes don’t carry spare wings. If the one it has fails, it (and you) are screwed.

Which brings up the concept of scale. As noted above, the right wing of an airplane is not redundant.

But if you look more closely, the innards of the wing include a lot of redundant structure, such that any one of several different pieces of metal could fail without causing the wing to fail. And if you look more closely yet, an individual piece of metal is made up of crystals and the piece is sized bigger than its expected load, such that on crystal can fail without the piece failing.

The same thing applies to software / IT systems. Folks talk about multiple servers, dual power supplies, etc. Ask them what happens when the building is flattened by a meteor. A whole different set of design criteria (and costs) are needed to eliminate SPOF at that scale.

Ultimately, whatever we build has a SPOF in that when a Doomsday meteor trashes the entire planet the system will fail. The good news is the requirement for uptime goes away at the same time.

Said another way, in the real world, almost all systems have fuzzy boundaries and are prone to SPOF-like degradation when things just beyond their boundaries fail.
Finally, I’ve never heard of any concept of SPOF in any discipline which corresponds to what I think the OP said. In every discipline with which I am familiar it’s as other’s have noted: A point within a given system boundary where a failure will result in either degraded operation or catastrophic failure of the system. Which threshold to use depends on the issue at hand.

Perhaps the OP can provide a few citations that use his first definition? (I’ve also never seen it.)

I’ve had a few too many drinks to fully articulate what I’m trying to say, but it appears that the OP’s first example is describing a designated point of failure. Where when a component is stressed beyond its operational parameters it fails at a predetermined point to contain or reduce the consequences.

Crumple zones in cars or shear pins in heavy equipment for example.

Not really a single point of failure, but an example of desiging where the failure will occur as a feature rather than a flaw.

That’s not bad, even for the alleged drinks.

Cosmic Relief, is this what you meant or if not, what is different?

I’ve also heard the phrase “single point of failure” used in regards to rock climbing where it is considererd a Very Bad Thing ™. In my training, the teachers emphasized over and over that you should never design / use any equipment or protection that includes a single point of failure. When you run the rope through your harness, you run it through 2 pieces of your harness. If you attach a top rope always use 2 of everything. And on and on.

Because if you have a single point of failure, and that piece fails, you die. Or, in morbidly descriptive rock climbing parlance, you crater.

J.

Sheesh, talk about pennywise and pound-foolish. 15 drives wouldn’t cost much more than 14 drives, and would provide enough redundancy that you could recover from any single drive crashing (provided that you knew which one). Now, to be really comfortable, I’d want yet more redundancy than that, but even in the most pessimistic case, you need a lot less than double the investment.

Or those weak points in a thistle’s root, such that if you try to pull the weed up, it breaks off a half-inch below the ground and grows back a couple of days later. Not that my gardening days have made me at all bitter about thistles, mind you.

I’ve been in the computer industry over 30 years and I, too, have never heard it used in this way. A SPOF is any component whose failure brings a system or network down.

There would no point in a system which was deliberately designed to conform to your meaning; you either build a minimum system that will end up having multiple SPOFs (such as a standard PC) or you design a system to have no SPOFs. Why would you design a system with precisely one SPOF? You may as well eliminate it too.