I am interested in hearing about what precautions are taken to ensure random bit flips don’t cause critical functionality to cease to operate correctly perhaps due to SEU or some other, for all intent and purposes, random event. In particular, consider a safety critical application such as jet engine or life support software where:
bool applicationShutdown;
.
.
.
/*UH OH! Random bit flip */
if(applicationShutdown == TRUE)
{
applicationShutdown();
}
Now I have heard that MISRA standards recommends/requires (?) that things such as booleans should be defined as:
this way a single bit flip doesn’t inadvertently cause a toggle from true/false. This would work for booleans, but what about an integer representing speed?
I am not so concerned about this as it relates to external sensors as presumably there are multiple external sensors communicating with your system where some kind of voting system is employed to weed out bad ones, etc. I am interested in the situation where we have the value stored in RAM and suddenly said RAM gets corrupted.
Can anyone speak to various standards/industry best practices? I have heard of EDAC protected memory:
I am not sure understand how this works. “Neighboring bits belong to different words” is confusing me. Suppose one had 2 words named A[31…0] and B[31…0] Are they suggesting the memory would look like:
[A31][B31][A30][B30] . . .[A0][B0]
How much slower would this type of memory be? Can anyone provide any historical perspective on how this used to be handled? Any input would be greatly appreciated.
There are many safeguards:
[ul]
[li]There is error-detecting and/or -correcting memory, for one—I believe you already touched on that.[/li][li]For bits sent over the network, there are error-detecting and -correcting protocols at various levels in the network. All modems and network cards in recent years are error-correcting, for example; higher-level data transfer protocols also implement various measures such as parity bits, checksums, CRCs, etc.[/li][li]Data is often checked after network transmission in an entirely separate procedure by computing a secure hash or verifying a digital signature. (If you’ve ever downloaded software from the web, you may have noticed that the page contains a separate link to an MD5 sum, SHA-1 sum, or PGP signature for you to verify the integrity of the download.)[/li][li]Back to entirely program-internal information, all critical operations in interactive software are safeguarded by a confirmation prompt. Thus even if the program mistakenly believed it had received a shutdown command, all it’s going to do is pop up a dialog (or some equivalent prompt) asking “Are you sure?”[/li][/ul]
In the old days, memory data was protected by a parity bit that would guard against single bit errors. These days there are more sophisticated means, but they serve the same purpose.
Don’t worry though, morons like me who wrote software with bugs are a much worse threat.
At one place I worked we had the mis-fortune of owning 2 Sun servers (4500, 3500) that crashed quite a bit due to memory errors. My understanding is that they did not have ecc memory (or it was in in the cache that wasn’t ecc, not sure). Anyway, those bit flips do cause problems.
On mainframes, the cpu will detect errors and roll back transactions at the hardware level to make sure the state of the system is not in error.
It seems that most bits that get flipped, statistically speaking, are in pointers, so the likely result of a flipped bit is software trying to dereference an invalid pointer. An MMU can catch that an an OS (or other software) can perform a post-mortem in some reasonably sane fashion.
I don’t know much about hardware (but I am a professional programmer) and basically, if you’ve got memory that can detect (but not necessarily repair) corruption, it’s relatively straight-forward to signal that to the operating system/software using hardware interrupts. Any transaction could then be canceled - or more probably, just not committed, because the corrupted process would get killed - meaning that any intermediate state is just “forgotten” and you end up with some known good state earlier in time. This could still cause problems (for instance, it might require manual interference to re-run the canceled transactions), and if the memory corruption is continuous (i.e. you’re not dealing with “normal” radiation), that won’t solve the problem at all, but under “normal” conditions, that should prevent the total system/data from becoming irreversibly corrupted.
The problem then becomes: what to do next? In most cases it’s probably simplest to just automatically re-run the pending transactions and continue, but in potentially disastrous conditions, you might want to shut down completely and force manual intervention.
Read the rest of my post: Back in the old days of Linux, GCC would sometimes crash when building the kernel. The cause was that GCC was dereferencing bad pointers. The underlying cause was that the RAM in those old PCs was bad and randomly flipping bits. If most flipped bits were something other than the contents of pointers, other errors would have been apparent.
I agree. I can’t imagine pointers take up lots of memory except for the most trivial calculations. The real problem is when you get bit flips in words that won’t cause your system or program to crash.
There is tons of literature on this very subject. Memories have lots of ECC (error correcting codes.) There are several ways of doing this for logic. Besides simple voting, there are ways of retrying a calculation if an error is detected. There have been some papers on adding logic to the flip-flops in the design to detect and correct bit flips in logic, though I have no evidence this has been a problem - yet. The reason logic isn’t a big problem is that a bit flip will be captured only if it happens on a critical node just before a capture clock. Most of the time bit flips won’t affect anything.
Mostly this gets handled in hardware, though software can do it by doing the calculation several different ways or repeating it.
Source are the Fault Tolerant Computing Symposium (FTCS) which has been going on for years. There is a Center for Reliable Computing at Stanford with lots of papers and presentations on-line. They sent an off-the shelf computer into orbit, into a region with lots of cosmic radiation and thus lots of bit flips, to experiment with the techniques I mentioned.
Since I don’t deal with highly reliable software, only hardware, I don’t know what actually is getting used.
Thanks to everyone who responded. This is very good information. So I now understand that this is primarily a concern on the HW and, of course, it’s the SW responsibility to possibly do some kind of fault accommodation if necessary.
It’s more likely that the bad RAM was corrupting all kinds of data, but GCC would continue merrily along without noticing until it deferenced a corrupted pointer.
Edit: That being said:
You’d be surprised. My company just switched to a 64-bit OS and we’re finding that our programs are taking as much as 50% more memory due to the switch. Now, a part of that will be due to the stricter alignment constraints on a 64-bit system, but you’ll find that you use a lot of pointers in serious applications. The abstractions that you deal with can make it easy to forget this.
Software can continuously check all inputs to all functions, similar to what you’d use for debugging (asserts, contracts, etc.). Even better is to run several copies of the software, and check that each behaves identically. For critical paths shared between your copies, you could use tricks like embedding checksums into your integers.
The organizations referred to by Voyager have probably thought of many such tricks, and evaluated their practicality and value.
The vast majority of programmers, however, just rely on hardware mechanisms. (ECC in RAM and even in caches.) Of course, these are limited. They reduce the incidence of errors dramatically, but do not stop them. But for most software, infrequent errors aren’t as problematic as many fear.
P.S. Most bits that get flipped, statistically speaking, are in empty or inconsequential regions.
The problem with any software-based approach is that the bit-flips can occur in the bytes which make up the program itself. Or the OS itself.
No matter how error-tight you make the SW, or even if you have formal proofs of its error tightness, that only applies to errors outside the SW itself.
So you really need to prevent bit flips down at the hardware level. If the RAM cannot return an erroroneous bytes, nor the whole RAM fetch pipeline flip a bit, nor the processor make a mistake internally, then you can have a fault-proof system.
At the arm-waving level of precision …Each layer of abstraction only adds opportunity for error; it cannot reduce it. A layer can try to check for inconsistencies in data at at it’s own level. A layer has a very hard time checking for anything at a lower level. Ultimately you end up trusting that the lower level is telling you the truth.
If the software starts misbehaving, this will likely show up as inconsistencies in its data or working state (which runtime checks can catch). Or else it crashes, of course.
You cannot make RAM that will never yield bit errors (and no one bothers to make ram that can correct more than 1 error per 64 bits).
A layer can check for inconsistencies at a lower level and doesn’t have to just trust it (network protocols do this all the time). What’s true is that it can’t do checks on behalf of the layer above it.
I think it depends on the application. Clearly a bit flip in a word that is uninitialized won’t matter. In most cases, this will be most of your memory. A bit flip in a word which has been read and will be written into won’t matter either. On the other hand, a bit flip in a small, heavily used, cache will get you if not corrected.
However, if you have a high error rate, you’ll soon find out - if you’re lucky.