Challenger disaster question about O-rings

No. That’s simply not correct. You design according to certain specifications intended to achieve a certain level of reliability over a certain period of time. You adopt quality assurance protocols to help you get to that point and stay there.

To do otherwise sets you up for lawsuits yes, but its not just talk. You do risk assements to minimize liability and if there is a failure you take countermeasures to prevent a reoccurance of the same failure mode. Its not fooling yourself, its playing it smart.

If you think you’ve taken something inherently dangerous, like space flight, and made it zero risk, then you are fooling yourself. Which is not to say that failures from any cause are acceptable, only that even a vanishingly small probability of failure is still a non-zero probability of failure.

The space program is full of smart people and they failed in the Challenger incident by ignoring the warning signs and the safeguards they set up. People were the cause of the failure, not the system itself.

People are part of the system. A system that does not account for how people might fail is a bad system. But that’s beside the point of whether the shuttles were ever designed—or anyone with engineering credentials involved in their design—ever for a minute allowed themselves to believe the probability of failure was or could be “zero” even if all the people involved performed 100% as expected.

ETA: None of this is to justify a cavalier attitude of “well, shit happens, spilt milk, broken eggs, etc.”—which is a pretty good encapsulation of the sort of attitude that can lead to events like the loss of Challenger—only to note that there is a difference between taking reasonable and necessary precautions in keeping with sound engineering practices versus “plan[ning] for zero incidents.”

I should know when to shut down at the end of my work day and stop answering questions from the hip. I deserve to be beaten by the six sigma people for making a rookie error.

They can beat you with their black belts…

Claimed reliability numbers were all over the place. According to Feynman, NASA management thought the failure rate was as low as 1/100,000:

Engineers at Rocketdyne, the manufacturer, estimate the total probability as 1/10,000. Engineers at Marshal estimate it as 1/300, while NASA management, to whom these engineers report, claims it is 1/100,000. An independent engineer consulting for NASA thought 1 or 2 per 100 a reasonable estimate.

It seems that only the independent engineer really had a clear grasp of the uncertainties.

It’s possible that in the early design days (70s), they were a bit more realistic, but by the time of Challenger something had distorted their thinking.

I have no six sigma certification, nor do I ever desire to. I am also not a rocket scientist.

Not really. The real pisser about Columbia was that the same kinds of mistakes that led to Challenger were behind Columbia too. Despite the supposed “lessons learned” from the first accident.

In each case the engineers were seeing documented risky events occur that had not been planned for. This was uncontrolled unmitigated unanalysed incremental risk over the supposedly known baseline. The engineers duly raised concerns to management who in effect said “Since we haven’t crashed from these unplanned events, that proves the extra risk must actually be exactly zero. Ignore them.” And that thinking carried the day. The rest is just details.

Ultimately IMO the shuttle was grounded not because the machine was unsafe insufficiently safe but because the total shuttle management culture was insufficiently safe to be allowed to continue operating anything and it wasn’t practical / affordable to replace enough of the people quickly enough to achieve culture change while still keeping enough actual expertise on board.


Thanks to the several folks above the corrections ref my comments on planned vs. actual shuttle reliability. That’s what I get for shooting from the hip late in the day vs. digging up cites. Y’all deserve better in GQ.

The punchline remains that from the git-go they expected to lose one or more before it was over, but started from the goal of minimizing that number to the degree the tech of the day allowed. Then lots of stuff happened to lots of people, organizations, budgets, and hardware over the roughly 40-year (!) period from first draft designs to final flight.

You have ab aviation background, how much of this is a bad thing? If a craft proves more robust than what had been the specifications isn;t it better to take advantage of that if possible (I realise that it probably is no for the Shuttle but maybe for earthbound aviation its not), and expand the envelop so o speak.

Confidence in safety ought to arise when observed performance is consistent with, or exceeds, expectation. So seeing burnthrough when none was expected is bad, even if the burnthrough doesn’t cause an immediate disaster. When observed performance exceeds expectations (no burnthrough even in conditions when burnthrough was not guaranteed to be absent) that’s when you can expand the envelope.

Statistical metrics for spaceflight are near meaningless. There simply have not been enough flights. About the only conclusion one could draw from the shuttle programme was that NASA was not capable of running it. Not that it is clear anyone else could have done either. But the culture, and two disasters that both stemmed not from engineering but from management failures was damning. Then you need to add in the near loss of STS-27. They had a culture of normalising failure that never got any better after Challenger. For both Challenger and Columbia they deliberately overrode flight rules that would have grounded the fleet until the problem was fixed. In both cases the overridden rules were for engineering design violations that did indeed end up causing loss of the craft.
Reading the final reports on both Challenger and Columbia is highly recommended and sobering reading.

Imagine an engineering team design and build a toaster. It toasts bread just fine, but the engineers discover that that every time you toast a slice of bread, it fires a potentially lethal bullet in a random direction. This was not part of the original design, it’s just something that happens. Management reviews the track record of random-bullet-firings and says “since nobody has been killed from these unplanned events, that proves the extra risk must actually be exactly zero. Ignore them.”

Clearly that’s not a prudent course of action. The toaster design is not especially robust; it’s just that users have, until now, been lucky not to get hit by any of those random bullets.

That’s pretty much what NASA did with the shuttles. In the case of the Challenger, the booster O-rings were definitely not designed to partially burn away; Feynman’s minority report exposed the bizarre rationale by which NASA decided that an average burn-through of 1/3 meant that the O-rings had a safety factor of 3. In the case of the Columbia, foam was not designed to be ripping off of the bipod ramps on the tank during ascent, but NASA kept rolling the dice anyway.

It’s not that the shuttle proved to be more robust than the specs said; it’s that NASA had for years been enjoying good luck in spite of bad judgment, and elected to continue leaning on that good luck until people got killed.

Yes. @Stranger_On_A_Train has given several scathing treatises on this statistical reality in prior rocketry threads though my search efforts came to naught.

Recalling I’m at the ops end, not the design end. I’d say “Yes, but …”.

The challenge as I see it is the entire vehicle system is very complex. There are always undocumented (or even unrecognized) interdependencies in addition to the 10s of thousands of documented interdependencies. However holistic and fully understood the original design process is, it’s highly likely that any significant later modification will be pursued less holistically and be less well understood.

The certification of the 737 MAX was a case in point. Things were proposed to be changed and the first order effects of the proposed changes were thoroughly analyzed. But they didn’t dig deep enough to discover that the changes were invalidating design assumptions in 3rd-order effects inherent to safety. And then to boot they further tweaked the proposed changes in a later round of refinement without revisiting the first-order effects thoroughly enough to realize they’d invalidated those design assumptions too. Significant harm ensued, both to blood and to treasure.

Conceptually similar issues of lesser impact have surfaced in the KC-46 and the 777X efforts. And most probably have done so as well in other company’s products; this isn’t a Boeing-only phenomenon. Boeing’s issues have just been the ones recently well-exposed from investigations with public reports of findings.

The underlying point is that the engineering effort to be sure you’re not opening a can of worms is probably bigger than the payoff to be had from post-hoc apparently minor tweaks like exploiting stronger-than-expected behavior. Whether in ops procedures, or in factory fabrication, or in maintenance.

That’s not to say that learning and in-service experience aren’t invaluable or don’t inform procedures and limits going forward. They certainly do. But it’s pretty incremental around the edges. A common example being that they assume some part will wear enough to need to be replaced at, say, 1,000 hour intervals. After replacing a lot of them and analyzing the removed parts the engineering folks discover the removed components are still nearly unworn. So they can soundly step the replacement interval out a little. But not a lot at any one bite. Lather rinse repeat.

But everyone needs to be mindful that the entire rest of the system has been benefitting from those unworn parts. IOW we don’t actually have in-service experience with the rest of the system while running partly worn parts as we designed it to; we’ve only got in-service experience with anomalously under-worn parts. But we’re about to begin gathering experience with worn-as-designed parts if the change interval extension is approved.

The classic is example is discovering that your filters aren’t filling with debris as rapidly as you thought so you decide to run them longer. But now more debris is circulating in the system and you begin to see pump or valve failures you hadn’t seen before. Turns out the filters were over-designed and the pumps/valves were under-designed for the actual debris load. The filters had been compensating for the pumps/valves, so the shortfall was masked in practice. Until it wasn’t.

In a sound bite: If you find one link in a chain is stronger than expected, that says nothing about the whole chain.

It’s possible in principle (though perhaps not in practice) to chase all the known unknown rabbit holes to the bitter end. It’s the unknown unknowns that still lurk.

Maybe smarter to leave well enough alone and just enjoy the enhanced reliability over expectations.

Another way to look at it is that the Shuttle was unsafe because it didn’t have enough failures. There are approximately a bazillion (1 bazillion = 10^umpteen) different things that can fail on a machine that complicated. Each one has an extremely low probability of failing, but when you put all of them together, the aggregate chance of failure ends up uncomfortably high. And so the only way to make the system significantly safer is to make all of those bazillion things safer.

So, the O-rings failed. We saw how they failed, and when we saw that, we took steps to improve them. Even in a safety-blind culture like NASA turned out to be, we can be pretty sure that there won’t be another O-ring failure that leads to catastrophe.

And so the next failure wasn’t O-rings; it was falling foam. So NASA fixed that problem, so that falling foam wouldn’t cause any more catastrophes.

Keep iterating this long enough, and eventually you’ve got a safe vehicle. But it takes a heck of a lot of catastrophes before you find (and fix) all the ways it can fail.

If planes hit birds on takeoff, and it is found that they are not taking as much damage as it was thought they would, then maybe you change your emergency procedures.

You still get rid of the goose pond between the runways though.