Firmware is the software inside my printer, yes?

Sort of. The danger is that the system can be in an inconsistent state as viewed from just the contents of the non-volatile storage. Part way through the update it may - in principle - not be possible to boot from anywhere. In reality it is usual to have two copies of the firmware, and to overwrite one whilst still running from the other. But even here you need to ask - how does the system know which one to try to boot from next time? So there is always a tiny window in which the system isn’t consistent. But sensible coding should reduce this window to reasonably short time intervals.

Much more complex systems, like those running full operating systems, may not have enough space to manage a live and a shadow copy, and they will perform conventional OS style updates. That is a bit more scary, as again, it is possible to find yourself in a situation where if you only look at what is on non-volatile media, the OS is in an inconsistent state. Quite a bit of effort goes into reducing the risks here as well, but you can never actually reduce the risk to zero.

(The problem is directly mappable to the Two Generals problem.)

This isn’t really true. One common way to ensure reliable firmware updates is to embed a strong checksum of the entire firmware image in the image. Then at boot time the boot code verifies the checksum and ignores the image if it’s not correct. The only risk is that some random data will accidentally match the checksum. But the risk of that is so low that it’s far more likely that cosmic rays (or a meteor strike) will render your system unbootable.

That only eliminates one risk – that of a potentially corrupt image. It does nothing about interruptions due to power failures or glitches or undetected communications errors or EEPROM write failures or any number of other problems.

I have no idea how robust a typical firmware update process is, but I do know that they can and do fail and can in some cases result in the classic “bricked” device. For that reason, I absolutely hate doing firmware updates and I deplore the casual way they are often suggested and sometimes even forced on you. I only do them when they are essential, to fix a known problem. The risk being that a firmware update that is interrupted – by a power glitch or anything else – may (depending on design) leave the device brain-dead and unable to recover or may cause other issues that are not necessarily recoverable. To add to the fun, some devices match their eagerness to force new updates with an adamant refusal to roll back to older ones – whatever engineer came up with that idea should be taken out back and shot, although it sounds more like the sort of madness that would come from a customer support bureaucrat.

OTOH, sometimes firmware updates are unavoidable. The most harrowing one I’ve ever done was an essential BIOS update on my laptop. It worked out fine, but I think I had more gray hairs when it was over than before I started. It could have turned the laptop into a useless doorstop.

It does handle all those cases, because in any of those events, the complete image is not written, or is written incorrectly, and the checksum will not match when the boot code checks it. Power failure, glitches, etc. eventually result in a corrupt image. If the image is not corrupt then by definition it’s bootable. Perhaps I’m not understanding the scenarios you’re describing. But I’ve worked on firmware update code for consumer devices for over 20 years and I’ve thought pretty deeply about all the potential risks (as have many other good engineers I’ve worked with), and I’ve seen the results of millions of devices getting regularly updated. Bricking a device due to an update is extremely rare and in my experience, always due to some type of permanent hardware failure.

BTW, BIOS updates on a PC are a different beast, because they rarely (I believe) use the two-image ping pong logic that we’re discussing. There’s just one copy of the BIOS, and if it is partially written, the device is bricked.

I don’t know how common the two-image logic is, but I’m more referring to the single-image situation where you’re committed to the new firmware once the process starts. I do know for a fact that devices can get bricked from firmware updates gone wrong (routers and many common digital appliances and computer components, including those that run actual embedded OSs under the cover) so I am conservatively cautious about doing them at all. Whereas if I’m updating a real computer OS (and I don’t do that without good reason, either) I can create a restore point and/or a full backup and there is recovery from even a worst-case scenario.

I agree that single image devices are definitely vulnerable to bricking during firmware updates. The issue is whether the manufacturer is willing to spend the money to essentially double the size of nonvolatile memory in order to protect against bricking during updates. If updates are expected to be very rare, they may well decide not to burden the product with the additional cost. My comments were specifically in response to Francis Vaughan’s assertion that even two-image updates have a window of vulnerability, which I do not believe is true.

:slight_smile: Typo or not, I may have to steal this.

I keep telling you people I’m a techno-peasant!

It depends on the design. To fully and automatically eliminate any window of vulnerability becomes a full-fledged problem in fault-tolerance engineering, not unlike the question of when to initiate failover in fault-tolerant system design, and in general such problems cannot be resolved by collaboration between software components alone. They generally require the intervention of a trusted hardware mediator, and it’s unlikely that ordinary firmware update schemes are going to be that sophisticated, so they will likely retain a finite if small window of vulnerability even with a two-image approach.

Of course with a simple design option the user himself could function as the mediator – the device could simply have a “reset” button that functions the same way as the “last known good” option in the original Windows NT OS. But again, I’ve yet to see anything like this in most devices. I’ve never had a device bricked but perhaps only because I’m so reluctant to do firmware updates, but I’ve certainly read a lot of disgruntled commentary on Internet forums from folks who have. I really like my laptop and trust me, the BIOS update was nerve-wracking!

I’m not sure why devices are designed to permit this possibility. They do not require dual images or double the internal storage. Rather they require a primitive, unalterable boot loader that can re-load the firmware (actually software). This is how computers have always booted up – the bootstrap sequence begins with some low-level ROM logic that in turn loads some functional (but still small) routine that eventually loads complex enough code to do file system I/O.

My guess is manufacturers of embedded devices (and even PC motherboards) began relying on near-full OS functionality for firmware updates, and if the OS or ROM BIOS ever becomes non-bootable (say due to power failure during a firmware update), the device is “bricked”. However this can only happen if they design an update process which requires high functionality to load a new image, and they don’t provide any contingency or primitive reload capability from a restricted set of I/O devices.

It is probably a tradeoff of additional design time for what they view is a rare case. Another factor may be that older more primitive yet easy-to-program interfaces such as RS-232 have died out. That was easy to program with a simple assembler routine and did not require an OS. It was trivial to have an unalterable ROM sequence to read a new image into memory.

USB requires more software logic which in this case would have to be in the ROM-based update routine. If the USB logic is part of an SoC (System On a Chip) the whole thing might require initialization just to reach the USB part.

For a device (say a camera) that uses an SD or micro-SD memory card, it seems a ROM-based update sequence could read a few sectors from binary file on a freshly-formatted card, which would then progress up the chain and load the new system image. Even if that image had to be placed on the SD card in specific sectors using some utility to simplify the update logic, at least the device firmware could always be reloaded.

However for devices without an SD card which must obtain the new image from a more complex interface such as WiFi, USB, etc, the boot loader would have to be much more capable. Maybe that explains it.

I once wrote software that, when debugged, was turned into firmware by burning into a ROM chip. I had a roommate who wrote microcode. His code was also burned into hardware, but it controlled what the CPU did when my software/firmware issued a machine-language instruction. I wrote ADD B and his microcode performed the task internally.

So yes, there is a difference.

Reload the firmware from where? Most embedded devices do have unalterable boot code, which boots the rest of the system from nonvolatile (but alterable) memory. A firmware update rewrites this alterable memory. If there’s only one copy, then corrupting that memory bricks the unit.

A fallback would be to allow the unalterable boot code to ALSO boot from removable storage (eg. a USB stick) or from the network. This, especially the latter, is MUCH more complex than just booting from NVRAM which already resides in the unit. Instead of just reading from internal memory, you need device drivers and file system code to be able to read from a USB stick, or even more, a network stack that can do DHCP, TCP/IP, FTP/HTTP, etc. and read from a network. This is all vastly more complicated than the code that normally resides in the unalterable boot code, and increases the space requirements for the unalterable portion manyfold. And even if there were room to put all this extra code in there, it greatly increases the probability that there may be bugs in that code, which by definition cannot be updated since it’s in the unalterable area. Normally we like the unalterable boot code to be absolutely minimal and as simple as possible, since it can’t be updated.

A number of motherboards specially advertise a “dual BIOS”, though the idea is not so much to ping-pong as it is to keep one known-good copy around, and only flash the second. If it goes wrong, it reverts to the backup.

It used to be that you could pull out the old chip and reprogram it externally. If you were brave, you could take an identical computer, pull out its BIOS while running, and then insert the dud and program it correctly. But they’re mostly soldered-on affairs these days.

Actually I once worked for a company that tried to do something like this – they shoehorned what was essentially a whole OS into the first level boot code, so that if the system failed to boot, the boot code would automatically download and install a new image. I strongly advised against this, as I thought it was too risky to put this much code in the first stage loader. My objections were noted and ignored. Within a week after releasing this, we had a dozen reports of bricked units. They gave the units to me for analysis. I was able to determine that the cause was a bug in the loader code, which incorrectly detected a good boot as a failed one, and then due to a second bug, corrupted the (valid) system image instead of updating it. Although I knew this approach was dangerous, I did not expect the results to be so immediate and so catastrophic.

To me, a key distinction between firmware and software is that firmware should live in addressable memory. I.e., it needn’t be copied from a disk or somewhere into memory and then run. But, like I said, “should”. So it’s a very vague term.

For a lot of Android devices, to do an OS update the system boots into a limited version of the OS. Typically called the boot loader but that’s misleading as well. This version is separate from the OS that normally runs. It verifies the new OS is good using signature verification (don’t want users running a modded OS, do you?), unpacks the firmware file and stuffs the new files/directories into the right places.

If the “boot loader” needs to be updated (for example to block an exploit or it can’t handle some detail of upgrading the OS), then the regular OS does all the verification, unpacking, copying of that other OS.

TiVos are an example of a two full copies system. The bulk of the OS is stored twice on two separate partitions. During an upgrade, the new OS is put on the currently unused OS partition, a flag is set to mark it the active OS partition, and on the next boot it uses the new OS. If the boot OS notices something’s wrong about the current OS partition (e.g. the user tried to modify the OS), then the other OS partition is used. If both are bad … oh well.

There is a basic boot system on TiVos (greatly expanded on the latest models) which verifies signatures of key OS files to prevent user tampering. As the regular OS boots, it does even more checking of a lot more files before going about its normal business.

So on TiVos it is this basic boot OS that’s called the firmware and as far as I know TiVo doesn’t issue upgrades for it. Someone used to provide EPROM substitute chips for early models so you can solder (!) one in and avoid the file check stuff. Then you could mod the regular OS to your needs. But Series 4 models have secure EPROMS so no one ever got a copy of the original to duplicate.

True. But what defines “strong”? I’m being very picky* here, but there is no known mechanism to reduce the risk to zero. Just using the word “strong” acknowledges the weakness. A weak checksum algorithm leaves open the possibility of false matches. A strong one merely reduce the chance of a false match to something so infinitesimal that the sun will be cold before we are likely to have a problem.

Challis’ algorithm is another way of coping with two phase updates.

  • Insanely in fact. But this is the theoretical, not a practical question.

:confused: “Failover”?

When the active part of a multiply redundant system fails - the system automatically hands over control to a backup component. So failover.

This problem is also a two generals problem.

An asteroid could smash through your roof, hitting your computer and messing up your firmware update. Nothing is zero risk.

Traditional checksums are not that great, but modern hashes will have collision rates much lower than even very improbable events.

Really, the main question is whether you need cryptographic strength or not. Which is to say, do you need to protect against a determined attacker? If not, and you’re just subject to random faults, then a hash like MD5 is sufficient. If you do need such protection (or simply want to prevent end users from loading unauthorized firmware), then you need something like SHA-1.

Too strong a hash could actually give you worse results: it would take longer to compute, giving more opportunity for a cosmic ray to corrupt the image. If the probability of a collision is already infinitesimal compared to random faults, then switching to a stronger but slower hash would actually increase the failure rate.

Congratulations, you answered your own question.

Yes, as I already described, although the degree of complexity varies with the supported media types. Also as I previously described it could range from trivially simple (RS-232, which unfortunately is no longer a practical option) to very complex in the case of WiFi or network sources.

Yes that is a valid concern – IF the design used “all this extra code”. How much code is required depends entirely on what type and how many devices are supported to load image data from. I have consumer devices sitting on my desk right now which were manufactured in the last two years and only have an RS-232 interface for doing firmware updates. I can assure you as a designer of such devices it does not take “all this extra code” – it is a tiny, almost trivial amount of code in this case.

While the RS-232 interface is essentially obsolete, there are still embedded devices being sold today which use that. I mention this not to advocate it as a solution but simply to illustrate that a boot loader does not unavoidably require a huge amount of code.

Yes if you want the flexibility of loading an image from USB plus a TCP/IP network plus WiFi plus BlueTooth, etc, – that will take essentially an OS in the boot code, which incurs significant development cost and risk. However that is not always necessary. All the boot code need do is initialize a local I/O controller and read a single sector or data block from a fixed location, then execute that. Then the code from that sector has logic which incrementally enables a larger read, activates more I/O devices, etc. That is how a bootstrap sequence works.

Some so-called “bricked” devices may already have boot code which does this but requires a specific, highly limited data pattern on a single I/O device. E.g, a “bricked” camera with an SD card could have a very primitive ROM-based contingency boot option which simply reads the first addressable sector off a specially-prepared SD card. It would not even need a file allocation table – it’s a physically addressed read. Putting that on the SD card would require a specialized utility, but it’s contingency only. For some devices this might already be the procedure at a service depot, rather than tearing the device apart. It is simply not exposed to end users.