When computers HAVE to work reliably

Ha ha ha…ten to twenty years old would be a dream! Government agencies frequently wait until hardware is no longer functional or software and operating systems are completely unsupported before even considering upgrading to newer technology. Until recently, the DoD was still using computers with Windows XP, and we still have some specialized tools that only run on Windows NT or depricated versions of Unix from companies that have long ceased to exist.

One of the big issues with hardware that has to be rated for the space environment or survive in an adverse environment (e.g. intense gamma radiation, high altitude electromagnetic pulse, et cetera) is that it has to be qualification tested so there is confidence in the theoretical reliability, and that kind of simulation testing is really expensive to do even if it is successful, and if it causes a redesign or retest it can be a major hit. Most S-rated components take three or four years from design to market, and because there is a very limited demand they may be the only choice in a certain performance range for a decade or more.

Software has to be robustly tested in a similar fashion (regression testing) where it is run through a massive set of permutations which hopefully expose any latent errors from ‘dead’ code, memory leaks, or other interface problems. It’s an expensive, laboriious, and extremely unrewarding job to oversee this testing and refactoring, but when a single bug can make turn your billion dollar communications satellite into a very expensive celestial bobble or cause your rocket launch vehicle to fall out of the sky, you tend to put more emphasis on getting the details just so as opposed to, say, a car manufacturer which can make more nuanced (and lazy) evaluations of failure cost versus profit.

The command and control technology in silos is old because it is a major procurement and systems engineering problem to fund and implement upgrades that still interface with the overall system. Although there have been periodic upgrades of various sytems of both the missiles and wing EGSE, the general assumption is that the Air Force just needs to do enough to keep the system running for another decade or so until the ‘next generation’ system comes along. Ironically, the LGM-30G ‘Minuteman III’ was supposed to be supplanted and replaced by the silo- and rail-based LGM-118A ‘Peacekeeper’, which started deployment in 1987 but was retired in 2005 (and never deployed in the intended Rail Garrison configuration), and the road-mobile LGM-134A ‘Midgetman’ which was cancelled because of the end of the Cold War and due to significant development problems. So, all of the newer systems are gone, and the MMIII technology that is older than the majority of the officers manning it is still in use.

Stranger

It’s also worth noting that, while modern consumer apps are huge, a lot of that programming is dedicated to making it pretty for the user. In reliability-critical applications, you don’t bother with icons of papers folding themselves into airplanes and fluttering from one picture of a folder to another. At most, you’d just have “Moving files… Done” display on the screen. And likely not even that, because you probably designed your system to never even need to move files.

In fact, much of the computational effort in most of the applications people use goes into visualization and other user interface features, and if an animation screws up once in a while or a button takes a couple of tries to activate it’s a minor frustration. In most high criticality applications and especially embedded systems, the user interface is purely functional and often is little more than a few lights and buttons or a very simple display panel, and the user has very limited ability to manipulate principal parameters or interact directly with the operating system, which limits a lot of the things that go wrong with computers and software. On the other hand, such systems are often not well designed for security because it is assumed that the user will not attempt to access the underlying system; this is particularly true on commercial embedded applications like appliances and car control systems, which can lead to untested vulnerabilities.

Stranger

Expanding this to a more general theory: features are added to lots of commercial software all the time because they look neat, or because they make a cool demo, or for all sorts of other fanciful reasons.

Features are never added to safety critical software for reasons like that. Features are added very carefully, with thousands of man-hours carefully considering the implications of the features, exactly how they will work, and how they will interact with other required features.

Then the code that implements them is carefully written, reviewed, tested, and certified. This process is painstaking, slow, and expensive.

But, unlike Windows: it works.

They did use the hand controller from an Apollo craft as well. Perhaps the coolest thing is that the DSKY eventually used came from the just returned Apollo 15 command module, making it something of a historical prize in its own right.

The fly by wire project wasn’t later than Apollo, it overlapped. The project was initiated in early 1970, and it was Armstrong who was pivotal in pointing it at using the Apollo guidance computers. By the start of project proper in 1971 the later Apollo missions had been cancelled, and there were spare computers available. The first flight of the fly by wire craft was May 1972, well before the last Apollo landing.

One of the limiting issues was that the computer doesn’t have writeable storage for the programs - they are physically wired in during the construction of the rope memory. The rope memory production facility was closing in mid 1972, so the fly by wire code had to be ready, tested, and good for production before then. As it was, the deadline for the code was end of 1971. Eventually 60% of the code in the fly by wire system was lifted from the Apollo code base. They did however write program parameters into erasable memory - which had to be done when the aircraft was powered up just before flight.

The initial budget for the entire project was a scant $1 million. Eventually extensions racked up $12 million. Without the existing Apollo hardware and infrastructure it would clearly have not been viable.

This is all lifted from the rather good book - Computers take flight by James E. Tomayko.
PDF here.

We found that for processors at least while code overage was useful for the tests, the best method was random testing - in that test segments with instructions randomly selected were run, of course taking care that nothing illegal was done. That found a lot of corner cases which the designers never dreamed of.
While hardware testing paid my bills for a long time, it is not enough. You need to accelerate early life failures using burn-in or other forms of stress testing, which brings the left side of the bathtub curve out of the customer’s premises and into your factory. Expensive, but worth it for highly reliable systems. Way too expensive for consumer products, another reason they are not going to be as reliable as the space shuttle or a high end server.

Some of you might be interested to learn that Los Alamos offers a service where you can bring some of your chips to be zapped by radiation. (I forget the details.) We’d bring test chips with new technologies and new structures. I never got to go to the zapping, but I analyzed some of the results and went to plenty of meetings.

There are a number of facilities including LANL which perform various types of radiation exposure testing. Unfortunately, many are closed, projected to close, or are overloaded at a time in which the criticality of single event upsets is becoming even more of a hazard.

Stranger

The whole thread can be summed up in these words.

Painstaking. Slow. Expensive.

Here is an interesting (slightly older) article about writing nearly perfect software for the Space Shuttle.

https://www.fastcompany.com/28121/they-write-right-stuff

Interesting. It’s been a few years since I was involved in that work.
SEUs for memories are already with us, of course. I know of one product from a few years back where the failure rate was a function of the altitude. (Which we recorded.) I’ve been expecting it to be a problem for logic for some time, and there have been some solutions proposed around flip-flop design, but I haven’t seen much evidence that there has been a problem yet.
I’m talking of terrestrial applications, of course, not space ones.

Story of a C-130 Flight to Bar Yehuda Airfield on the Dead Sea. Apparently, Flight Computers don’t like to operate 1240 feet below Sea Level.

:smack:

The main reason is simple commerce: customers consider anything that isn’t less than six months or old so to be obsolete in the computer world, so we literally make change for change’s sake, just to keep them “fresh.” I’ve occasionally suggested the release notes on software I’ve worked on should just read “Now, with new and improved version number!”

[mode=topper]
I know of a local retail store that, as of three years ago, still uses a DOS-based system for internal use. I’d be really surprised if they’ve upgraded since then.
[/mode]

For an example of what happens when government tries to get ultrahigh reliability combined with high functionality, look at all the troubles the F-35’s had. Even after something like 400 billion dollars, large chunks of its software aren’t ready. I’m sure some people here have more insight into it than I do and it’d be interesting to know their takes on what parts of the software they think is causing the most problems.

Just a general observation; but probably the biggest cause of massive time and cost overruns is requirements churn. Adding or changing (it is usually adding) requirements once the project is under way is almost guaranteed to blow things out in a manner much worse than the scale of the change. The trouble is that many contracts provide for lucrative (for the contractor) incentives to accept these changes. Even the most straightforward software project can be derailed once the customer is afforded the ability to change their mind once the project is well under way.

My understanding of avionics is that it works basically like this:

Each plane consists of 3 systems coded to the same specifications by different development teams. (Who do not share code or designs.)

Every decision is cross-checked between two of the three computers by the third one (in reserve).

If the two primary computers disagree on a calculation (which should never happen, as they were built and tested to the same specifications) the third computer “unhooks” them and takes over. (It doesn’t know which of the primary computers were wrong, but it knows at least one is.)

Also worth noting is that sometimes the computers will agree with each other-- that the inputs they are given make no sense. For example, this happened on Air France 447 when the pitot tubes measuring airspeed got clogged and were sending nonsense airspeed readings. The computers didn’t disagree, they just didn’t know what to do about it.

I have never seen an aircraft or space launch avionics system that matches this desciption. Critical systems will have redundancy, and some are triple or quadruple redundant with ‘voting’ logic, but nobody is going to path for three separate teams to independently develop flight code, especially how costly it is to test and validate even one set of flight software.

Stranger

My understanding is that Airbus have developed some parts of the flight control that way - not just different software teams but different processor hardware. Whether they continued with this regime I have no idea. IMHO Airbus has had an unfortunately large number of problems attributable to their flight control systems.

wasn’t it the first generation of F-16s that didn’t like crossing the equator? They worked fine on either side, but when the plane flew across the equator it flipped upside down? Or was that an urban legend?