While this statement isn’t wrong, the real issue is more complex than just, “The management knew it could fail but flew anyway.” There were clear indications of blowby (propellant gases leaking through the field joint and partially eroding the inner o-ring) but never complete failure of the o-ring and the constant jetting of gas. This was interpreted by both NASA and Morton-Thiokol management as providing x factor of safety, where the factor of safety was calculated as the ratio of the pristine thickness to the maximum eroded thickness. While it is not good engineering practice to assume that an unexpected operating condition provides margin based upon the fact that it hasn’t failed yet, the reality is that such frequently conditions occur especially with components under dynamic or erosive loads, and their relative integrity is often assessed empirically using “engineering judgment” (i.e. intuition) rather than statistical analysis (which tells you that there is a non-trivial chance of failure for everything).
O-ring blowby, by the way, is frequently seen at field joints and end caps of solid propellant motors, and as long as it is not excessive (e.g. degrading the mechanical integrity of the joint or causing a pressure loss in the chamber) is accepted as a normal operating condition. The reason this was a problem for Challenger is that the location and orientation of the leak caused it to jet into the thrust support and through the hydrogen tank (although frankly even if it had not ruptured the tank the unbalanced thrust and aerodynamic shear forces would have broken up the Orbiter regardless). It was known that the o-ring material was not resilient at low temperatures, generally resulting in a higher incidence of blowby at colder ambient temperatures at launch (though the data was not consistent; one of the most severe observed leaks occurred at the highest ambient temperature). While it is frequently reported that the low ambient temperature was the proximate cause of failure, there were other significant contributory factors that are not commonly mentioned such as the ground winds flowing in a direction resulting in the pooling of vented liquid oxygen in the specific region of the failed joint and the very high wind shear (higher than previous STS experience and exceeding 97% of the modern GRAM-99 statistical wind loading profile) experienced during ascent of STS-51-L, which caused unloading (‘rotation’) of the joint and exacerbated the leakage and subsequent erosion. Without these additional conditions, it is likely that the joint would have experienced only the intermittant blowby seen on other flights and no catastrophic failure would have occurred.
The NASA and Thiokol engineers who argued against the launch did so on the basis that the SRMs were not qualified to operate at the ambient temperatures of the STS-51-L launch, but the specific chained failure mode (blowby causing structural failure of the strut, et cetera) was not previously examined, and it took several weeks for the initial investigation to drill down to leakage of the field joint as the root cause of the failure. It is true that the o-ring, as specified, was being used in an out of design condition, and that the design of the joint, allowing it to gap open when under external shear loads instead of pushing it closed. The potential flaw in the design was known and Thiokol had proposed implementing a correction (which was implemented on the advanced fiber-wound composite motors for the Air Force “Blue Shuttle” program) but given the cost of replacing the reusable motor cases and the lack of evidence of failures there was no real impetus to implement a Class 1 change and requalify the motor as a new design with all of the attendant costs.
So the result isn’t just that there was an obvious failure mode which was ignored by management (though there were clearly indications of an out of design condition which could potentially cause other unanticipated effects) but that there were a confluence of conditions which resulted in an unanticipated catastrophic failure mode. This is what is known as a system failure; the individual components function within accepted functoinal parameters, but the system as a whole failed because of the influences of one part of the system on another. Note that this is hardly an isolated case on the Shuttle program; there were a wide array of known potential subsystem failures with non-negligible incidence which could result in loss of mission and crew which were mitgated only by monitoring, i.e. looking for indications of incipient failure which is sort of like a blind man locating a cliff by tap-dancing at the edge. The complexity and low design margins of the Space Transportation System guarantees complex failures which are difficult to design and process away. The shear quantity of risks on every flight makes it difficult to separate the wheat from the chaff to highlight the most critical risks to be addressed from nominal baseline risks. For instance, the impingement of frozen insulaton on the leading edges of the delicate RCC wing panels which caused the failure during reentry of STS-107 was a well-known problem that was evident from the beginning of the STS program (and noted on the first four test flights as a major risk) but took over a hundred flights before it manifested as the root cause of a catastrophic failure. The predictive tools used by NASA and USA indicated that the probability of occurrance was low given the number of successful flights, but in reality, every flight was one step away to failure, and frankly, there was no practical way to implement a fix without radical redesign of the STS system.
The real problem with the NASA culture, both then and now, is being both risk-adverse and risk-obtuse, i.e. management doesn’t want to accept risks as a consequence of operating a complex system, but also doesn’t want to engage in thorough analysis of risks for fear that the result will be an unmitigatable excessive risk that will shut down the system, which, frankly, is exactly what ended the STS program, so that institutional fear was entirely well-founded. The lessons learned from the loss of Challenger aren’t as simple as “Don’t design something that can fail”, “Never fly a vehicle that has any active risks”, or “Always listen to the engineers,” (who are perpetual worry warts) but rather a more complex set of lessons, such as the following:
[ul]
[li]Rocket launch vehicles are inherently complex systems with many unavoidable single point failures (SPF) and unanticipated chain failure modes; [/li][li]The STS, with multiple independent propulsion systems, parallel staging, required gliding cross range, and low design margins dictated by high performance requirements, was especially and overambitiously complex for the capability it provided;[/li][li]Success in prior flights does not mean that all possible failure modes will be exercised, i.e. there is no such thing as “flight qualified” as no flight will experience the extremes of loads and environments with margin required to have significant confidence in the reliablity over the range of probable operating conditions, and you may be at the edge of the cliff on every flight without knowing it;[/li][li]Performing thorough component level qualification testing and operating within the qualified baseline is crucial to reliability at the component/subsystem level, but system failures can still occur;[/li][li]Out of design conditions (like o-ring blowby) should be addressed in a failure modes and effects criticality analysis which considers the system-level effects, even for systems with extensive flight experience;[/li][li]Just having a process, and especially monitoring-type processes, will not save you; complex systems require consistent critical analysis and independent evaluation;[/li][li]When it comes to very high value payloads (e.g. crewed missions) there should be practical abort/recovery modes during every stage of the mission; the STS had no abort modes between SRB ignition and SRB separation, which meant any significant failure in the SRB system was essentially guaranteed to result in catastrophic loss of crew and vehicle.[/ul][/li]
Because of the high consequence of failure of rocket systems (e.g. a failure in the propulsion system, guidance system, or structure is an inherent catastrophic failure) practical demonstrated reliability is always lower than would be acceptable for any terrestrial pursuit. The first level Bayesian prediction of the mean probability of failure for a new type launch vehicle within the first five flights is more than 80%, and even after ten successful flights and no failures there is a predicted 1:5 chance of failure in one of the subsequent three flights. Even after a hundred successful flights and no failures, there is almost a 10% chance of failure in one of the next ten flights, and a single prior failure increases the odds of failure in the next ten flights to almost 20%.
What this means is you can’t just look at past successes and “roll the dice”; you have to be vigilant about addressing anomalous conditions while being realistic about accepting risk and having the expertise to winnow through the massive array of possible failures to identify those that are most crucial and probable on your system. It also means accepting the cost and consequences of risks, up to and including abandonment or redesign of a system that has inherent failure modes that just can’t be fixed in the existing design space. Given the inherent budgetary and schedule constraints imposed on the STS program, this last condition was a non-starter for identifying any critical failures which could not be fixed by process changes.
It’s easy to blame NASA and Thiokol management for the loss of Challenger, but the reality is that the STS was a prototype of a complex system that was pushed into production on a realitively shoestring budget without the ability to fix any of the inherent problems with the design. Despite the o-ring induced failure of the STS, the SRB was actually one of the most reliable and trouble-free components of the Shuttle system, and the consequence of failure can be attributed more to the innate criticality and complexity of anything in the propulsion chain rather than the (by itself) minor leakage at the field joint that didn’t compromise the performance of the SRB by itself.
Stranger