One presumes (hopes; fervently prays) that something as important as the President’s Nuclear Football is not running Windows 11 and Google Chrome. So how do absolutely vital systems get set up to be as reliable as humanly possible, and what keeps that level of reliability beyond the reach of common applications?
Probably like most other technology employed by the government, by being at least ten or twenty years out of date. So old that all the bugs are worked out, and have been worked out for years.
In addition to age and extensive testing, it’s also likely written in a low-level language with less layer of abstraction between the user inputs and the metal.
Command applications sacrifice reliability for cost, innovation and flexibility (in the form of interoperability and expandability).
Note that you likely already have systems in your home which resemble that kind of reliability through simplicity; Your coffee maker or other appliances likely have software running on them. I don’t know if the Football is that simple but, from what I know of it, it doesn’t seem to require much complexity.
The ‘football’ is a briefcase with some written material in it. The most complex technology is probably the briefcase combination locks which are both set to 123.
Just being old doesn’t make it safe. What’s needed is extensive reviews & testing at every step during development. Software architecture is developed through multiple reviews. Every code change is reviewed by other programmers. Every routine is tested and verified by other people. Every software goes through an exhaustive validation test.
This is a complex topic, but no system is 100.00% reliable. Instead, critical systems are designed and engineered from the ground up to approach this value to a reasonable degree. One element is redundancy (and replication), so that your system is fault-tolerant; that goes for hardware as well as software. To contrive a simple example, instead of one CPU you could have three CPUs of different designs running three independently-developed programs all performing the same computation. If all answers match you have some confidence in the result; if only two match you can still go with the majority.
I worked in this area, and have worked closely with people at a National Lab on highly reliable chips for satellites (or they told me) but really nuclear weapons.
Besides the obvious TMR stuff DPRK mentioned, they did things like build Intel processors under license in a radiation hardened technology, much less aggressive - but more reliable - than what is commercially available. You can more parity bits to memories (the least reliable part of a chip) to allow for error detection in almost all circumstances and error correction in most. The memory in your PC has this already, but you can do a lot better if you can afford it.
You can check the results of arithmetic operations looking for errors. And highly reliable servers and telephone switches, both of which I’ve worked on, fall over to backup hardware at a grosser level than typical TMR. For instance there were two computers in the AT&T #5ESS switch, with one shadowing the other, so if there was a hardware failure in one the other could take over. Worked very well unless someone didn’t test software updates.
All of which seems to have nothing to do with the football. But hardware and software reliability is a gigantic field.
To add a bit more. It is simply hard.
You can look to examples of mission critical computer systems and discover extreme measures taken to ensure reliability. Manned spaceflight is perhaps one of the more challenging.
Getting software reliable enough to trust it to launch people into space isn’t exactly easy. Worse, the systems you are controlling are hard real time. Real time means the code is executing along with stuff happening in the real world at the same time. Hard mean hard deadlines. If the code takes longer than allowed to complete its task the system is in error, and may well fail. Apollo 11’s alarm codes 1201 and 1202 get right to the core of this in ways many don’t appreciate. This was a hard real time system executive - one of the very first - telling the astronauts that it had a problem, but that it was coping - and it landed on the surface whilst doing so.
Some of the biggest advances in software engineering came out of working out how to manage these systems. And some of the unsung greats.
Making the hardware reliable may involve heroic effort. As above, you don’t use the latest and greatest. You use a few generations old technology, you use radiation tolerant design techniques, and you duplicate things. Favourite example - the second gen space shuttle main engine controllers were four identical processors (off the same wafer, with as near identical electrical parameters as possible, and they run them in lock step - one pair a clock step behind the other. Additional logic tested the results to ensure everything was identical, and could swap to the other pair if needed.
Even in commercial systems dual lock step processors is common. Back in the day you had companies like Tandem who made the Non-Stop machines. (They still exist somewhere inside the behemoth that is HP.)
But in the end, there is no substitute for rigorous design techniques, simple as possible code, code reviews, and testing. All sorts of testing. And when you are done, more testing. Not sure, more testing. You test the tests, and you test the testers. Some testers are allowed to do anything they like to try to kill the code - like look at the source code and actively try to find ways of breaking it. Other tests just try to drive the code into every possible corner it might possibly go and break it that way. And you have automated test systems that check to see if every line of code has seen activity during tests (coverage).
And you build integrated test harnesses to test the entire hardware software system. And you drive this as hard as you can to try to break the system, and you test that when you do break something that it still behaves safely.
Needless to say, this isn’t cheap.
Phone systems, as noted above, were another good example. Systems required reliability of less than a couple of minutes downtime aggregate per year. Specialist languages - such as Erlang - were developed to aid in developing code that was both highly reliable, and could actually sustain in flight update. And yet more redundant hardware. (And indeed, it all worked great until it didn’t.)
Not on the level of nuclear footballs, but I used to work in the power industry using computers to automate substation and power plant processes, and redundancy is the key. We’d get computers with no moving parts, designed specifically for the application. A big heat sink instead of fans, solid state drives instead of spinning rust, probably running a stripped down, very stable version of Linux under whatever software we were using. I can’t speak to how programmers designed the software though. These devices had a mean time between failure of about 400 years (this means that if you own 400 of them, you can expect one to fail in a given year). And then we’d use two of them running in parallel, probably both connected to two separate redundant networks if communications reliability was important. If one device (or comm link) failed, which was rare enough, the other would still work while we replaced or fixed the failed device.
A work colleague of mine who used to work in the aviation field told me something similar on how aircraft software is written. Obviously you don’t want to have to reboot your avionics OS at 200 knots 100 feet above the runway on final approach.
I would imagine software might be “old” not because old is reliable, but because it takes a lot longer to test mission critical software. Also because of the extensive testing, people are less likely to replace it as often.
I would also imagine that a purpose built computer working in a controlled environment like an airplane or nuclear plant can be built more reliably than a generic all purpose PC that any customer can install any software on.
For the industries (not nuclear devices) there are 3 sector specific standards using the IEC 61508 framework, IEC 61511 (process), IEC 61513 (nuclear) and IEC 62061 (manufacturing/machineries).
There are multiple levels of safety and the Safety system is a very simple system that takes the system to a fail safe state whenever the control system fails.
Exactly right. Some months ago, I was reading an article on the subject, and it mentioned that the computer technology in our missile silos is so old and simple that it is safe for that reason.
And astoundingly, this is exactly what was happening on Apollo 11 as they landed. The computer was designed to be able to reboot, work out what it was doing, recover the state, and continue to control the lander. In fairness, the alarms were not going off at 100 feet, but it wasn’t good. There was a second smaller computer that would control a landing abort if the main computer properly failed. At 100 feet to go however things were grim, and an abort wasn’t going to work no mater what was controlling it.
Not sure if you are joking, but you are spot on, at least in the Kennedy era:
From page 359-360 of Command and Control: Nuclear Weapons, the Damascus Accident, and the Illusion of Safety by Eric Schlosser
(emphasis mine)
(as an aside, I went looking for one of those handy-dandy MLA or APA citation formatter tools and found that they have all gone the way of the dodo, replaced by payware services all with different web sites but suspiciously similar UIs and asking me to upgrade to premium. oh well, TANSTAAFL.)
Life-critical systems are developed to a higher cost point and a much lower function point.
This is commonly misunderstood. I have read statements by Senior VPs at well-known software companies bemoaning they cannot develop software as reliable as the Space Shuttle computers or an airline flight control computer. They advocate “learning from mission critical developers how they do that”. This shows a profound lack of understanding.
Reliability is the product of hardware and software.
The Apollo Guidance Computer was custom-built from the component level to be the most reliable non-redundant computer possible. If a single transistor or IC on a fabrication tray failed testing, the entire tray would be discarded – even if all the others passed.
I don’t remember the exact headcount of the shuttle PASS software development and test team but it was hundreds of people working for years. In very rough terms about the size and development duration of a major new version of Windows or macOS.
So if you have the same # of people working the same amount of time to develop and test 1/100th or 1/200th the amount of code, it can be more reliable.
From a hardware standpoint, the Apollo AGC was so expensive it was already recognized by the early 1970s when the shuttle flight control system was developed, that it would be economically impossible to use that model. Thus the shuttle used quad-redundant IBM AP-101 computers operating essentially in lockstep. If one failed it was voted out. To avoid the possibility of a generic compiler bug crashing them all, a 5th backup computer ran clean sheet software developed by a separate team using a separate compiler. So there were two redundant development and test teams, yet this was still less expensive than doing it the Apollo way.
The Apollo AGC was so expensive and so reliable, that years later when NASA developed the world’s first digital fly-by-wire aircraft, they simply used a surplus Apollo computer. Developing a new flight control computer from scratch would have been too expensive for a proof of concept test. Here it is keyboard and all, mounted inside the F-8 fuselage. The keyboard obviously wasn’t used in flight: they just needed the computer to implement the fly-by-wire system: https://www.nasa.gov/sites/default/files/images/362812main_E-24741_full.jpg
More recently, I believe the Airbus A380 flight control system is five-way redundant, one three-computer primary group and a two-computer backup group. It can fly using only one computer in a group. I suspect the software on the primary and backup group is probably developed separately.
From the standpoint of either hardware or software development, it’s not commercially feasible to deliver such low-function, high-priced products for general use.
MS-DOS 3.0 had about 40,000 lines of code, within a factor of 2 of the Apollo AGC. A very small development team wrote that and it ran on off-the-shelf microprocessors. If hundreds of people worked for years writing and testing it, if their development mission was absolute reliability, and if it ran on custom-built hardened redundant computers, it would be very reliable – and no customer could afford it, nor today would they want it.
Today despite improved software development tools and hardware, life-critical real-time systems tend to be very function limited, plus they are often redundant. The methods, software tools and personnel structure to produce those are not generally applicable to mass market commercial software which has 100x the lines of code.