I’m a little confused about the correct use of the term “single point of failure”.
My understanding of SPOF is a system where a maximum of one component can cause a catastrophic failure. This is a good thing, because it minimizes the probability of a catastrophic failure. SPOF is a characteristic of a system, not a component.
In the computer business, people universally use SPOF as a term to describe any component that can bring the entire system down. To them, a SPOF is a critical component that is not redundant. This defnition is apparently so entrenched that I was unable to google anything other than this definition.
I think the computer people have it wrong. And ordinarily I couldn’t give a flip about terminology wars, except I believe SPOF has a specific and unambiguous meaning that is being lost through misuse.
I’m a computer software guy, and I’ve never heard your first definition. In fact, I really don’t understand what that definition is supposed to mean – if a single component can cause catastrophic failure, that would seem to indicate that the probability of such failure is high.
To me, a Single Point of Failure refers to a particular point – the component or components which have no redundancy – and not the system as a whole.
In computer systems terms, the first definition doesn’t make any sense. An application that has even one component that could cause a catastrophic outage has one too many. An application should have (or at least be designed to have) no non-redundant components.
So I don’t know about the etymology, it could be that you’re right, but I’m having a hard time imagining any type of engineering endeavor where having even one non-redundant point is ok.
I’ve never viewed SPOF as a characteristic of a system.
Logically, to me, it breaks down something like this:
point-of-failure: any point at which the system can fail. For the system to fail, it may require a combination of other failures.
single-point-of-failure: any point at which the system can fail, just by itself. However, this doesn’t imply that it is the only one. In a system, there can be multiple single-points-of-failure.
singular-single-point-of-failure: only one point in the system at which the system can fail, just by itself. There may be other points of failure in the system, but those failures require multiple concurrent failures in order for the entire system to fail.
But who knows, my interpretation could be completely wrong
Would it be fair to suggest that in engineering (or whatever), they feel it’s a good idea to have only one SPOF, whereas to computer folks, one SPOF is one too many?
I guess it depends on what SPOF really means. Under my own (maybe misguided) assumption, engineering would ideally want to have zero SPOFs, and only POFs.
I also am not quite sure what you mean by the first sense; I’ve only heard it used in the second sense of a non-redundant component, where it’s never a desirable quality.
Is your meaning talking about things like sacrificial components (fuses, sacrificial gears, weak links), designed to fail cheaply before the expensive parts of the system break? That’s the only thing I can think of from your description.
That is what I would go with as well. A single point failure is where a single component in a system can cause the system as a whole to fail to function. A definition of a single point failure does not define the probability of a failure occurring. Other failure modes exist.
Regardless of computer software or engineering of any form of single point failures are always unwelcome and should be designed out. When they cannot we typically resort to derating the load rating of the system, or enhanced maintenance and replacement cycles etc.
I think the difference between engineering and computer systems is that a failed computer system won’t cause a chain reaction to become a much larger problem, whereas a “break-away” object (like a fuse) can save millions of dollars. Single points of failure in a mechanical/electrical device (although I’ve never heard them termed as such) are a very good thing for that very reason. But, single points of failure in computer system means that there isn’t redundancy (and most important resiliency), thus when the system goes down, there is no automated means of recovery. There is no upside to SPOFs in a networking/computer environment.
Avoiding SPOFs in network design not only provides for immediate failover, but it gives us the option of configuring/patching/maintaining equipment without any downtime.
Your first definition makes no sense. How can anyone guarantee that there’s only one way to make the system fail?
The second definition is the correct one. It is a single point of the system, where if that point fails the whole system fails. And the system can have multiple single points of failure.
And as was pointed out, the existance of failure points says nothing about how likely those failure are. Just that if they occur, the whole system crashes.
If you can tolerate periodic system crashes, then single points of failure are tolerable. But if the system crash would be a catastrophe, then the existance of single points of failure is a intolerable, because eventually that failure is going to occur, and then you’ll be screwed.
Typical examples would be a database system that depends on data stored on one hard drive. If that hard drive breaks, then your entire database is lost. But everyone knows that hard drives fail periodically, so any database that relies on one hard drive functioning perfectly forever is a system designed to fail. Which is why splitting databases onto multiple servers with multiple redundant hard drives is the universal real world practice. When one hard drive in your array goes bad, you shrug your shoulders, unplug it, and plug in a new one.
I remember one test lab I worked at where an multiple-drive array was created with no redundancy. With 14 drives in the array, this mean that if any of those 14 drives failed, the entire array would fail. In that case multiple drives increased the chance of failure since each drive was a single point of failure. And of course, the reason it was set up this way was to get maximum storage space out of older low capacity drives…but of course since the drives were older, the chance of getting a drive failure on any of them was high. Now guess what happened when we needed a full week of continuous stress/perf testing, and the scheduled release date is only 10 days away?
It’s widly known that when the Challenger exploded in 1986 that one of the engineers at Morton Thiakol instantly suspected the O-ring dailed. They’d been talking about that day as a failure point, the failure point.
Well yes, makes sense. But it all relates to a point raised by others: how could you be sure that that one point was the only possible POF?
So maybe OP is referring to something like a fuse, also mentioned above, which is designed to prevent other failures from becoming catastrophic. After all, the only function a fuse serves is to fail if something else goes wrong.
OK… unsurprisingly I have a bunch of computer people telling me that the first definition makes no sense whatsoever to them. This confirms to me that the second definition is universally held among computer people, but I’m a little surprised that you guys can’t even try to wrap your head around the first definition (what I think is the classical engineering definition).
If I recall my probability correctly… let’s say a certain class of module has a 1/100 chance of failing on a given run. (It’s a piece of junk). Let’s say a given system contains 4 of these modules in a non-redundant configuration. The chance any module failing is 4/100, if I recall my probability math correctly. Multiple points of failure is worse than single point of failure. I think we can all agree on that. However many failure points you have, you try to make them redundant. I think we agree on that as well.
What I think is that even if you make a particular failure point redundant, you still haven’t eliminated the point of failure. (Both halves of your cluster could fail, for example). You have only reduced the probability that a failure will occur at that point. There is no such thing as a zero point of failure system.
[Nitpick]Approximately right answer, wrong method: the probability of at least one failure is one minus the probability that none fail i.e. 1- (0.99^4) or 0.0394. Multiply by 100 for a percentage (3.94%). [/Nitpick]
But in your example, will the failure of any one of those components cause the whole system to fail? If so, then each of those components is a single point of failure.
If the probability that a component will fail is 1 in 100, then the probability that it won’t fail is .99. The probability that no components would fail would then be .99^4, or .9606, therefore the probability that the system would fail on any given run fail is 0.0394, or about 1 in 20.
But if your system will continue to work as long as one component is still working, then the chance of failure per run is .01^4, or 0.00000001, or 1 in 100 million.
The first system is hideously unreliable. The second system is highly reliable, even though both systems use the same fallible components. The second system doesn’t have a “single point of failure”.
If your definition was ever the classical engineering definition, it has been completely subverted by now. SPOF is used frequently when discussing avionics failures after airline crashes. If you read any of the material published after the Challenger failure, you’ll find the documentation riddled with references to SPOF in the “computer” sense, and I’ve never seen it in the one you are talking about.
An example, from The Challenger Launch Decision: Risky Technology, Culture, and Deviance at NASA:
(emphasis mine)
This is a perfect example: there are actually four of these single points of failure in the system, but any one of them may fail and take out the whole thing.
The “engineering definition” to this engineer sounds like pure hooey. If you can eliminate all of the other potential sources of failure, why wouldn’t you eliminate the so-call single point of failure? Engineered systems (including software systems) are complex, and there are multiple points of failure. It’s not “only one, single point-of-failure” but rather “an individual point, in which failure can occur.” The whole idea is that various things don’t have to go wrong in a combination of lucky circumstances to bring a system down, but rather that a single component of the system can bring the entire system down. It’s generally not a good thing to have a single point of failure, but in many cases, multiple single points of failure just can’t be eliminated.
I’m a mechanical engineer working in aerospace (rocket boosters and support systems); our definition for single point of failure–used in reliability and FMECA (Failure Modes and Effects Criticality Analysis)–is definitely the second. SPFs are to be avoided whenver possible; even if the incidence (or occurance)–one of three general categories used to rate the risk level–is low, the severity of failure is high, because it means an instant failure of a subsystem and potentially the entire system in some mode. (The third category is detectability or inspection, i.e. your ability to detect or identify a failure before it becomes critical to operation.) SPOFs are always high risk, and are only mitigated by a combination of high design margins, proof and acceptance testing, and inspection. Although you would like to get rid of any single point failures in a system–especially one like an aircraft or rocket booster where system failure means mission failure, loss of payload, and potential hazard–it simply isn’t possible to design a complex system without some kind of SPF as Balthizar notes.
I have to admit that I don’t really understand the o.p.'s first definition: “…a maximum of one component can cause a catastrophic failure. This is a good thing, because it minimizes the probability of a catastrophic failure…” I think he means that it is a system in which only one component can fail (???), or that the failure of one component in the system does not impact the operation of other components in the system, either because all components are redundant or because the function of an individual component is not critical to the rest of the system. In reliabillity engineering terms this would be called a “fully redundant system” (or dual redundant, triple redundant, et cetera). In rocket boosters, flight termination systems (ordnance designed to stop the motor from thrusting and/or break it into pieces that post a minimal hazard to people and structures) are always at least dual redundant; that is, they have two seperate antenna, triggering systems, ordnance lines, et cetera, so that the failure of any component on one system doesn’t impact the reliability of the other system.
I’ll also second a request for a clarification of the OP, and agree with the question of the assumption that you can design a system to completely eliminate failure.
Even introducing redundancy does not eliminate possible failure; it usually can only shift the point of failure to a more reliable component. For example, having redundant power supplies will help reduce failure in that generally power supplies fail more often that other components. However, you need to have a control system to manage the redundant systems. If designed correctly, this would reduce the risk, but it can’t eliminate it.