If computer memory is subject to random bit flips from cosmic radiation and the such, how come that doesn’t manifest as random display glitches – letters changing here and there, random crashes when you’re not doing anything, etc.? If generic computers with non-error-correcting-memory can have uptimes measured in months and years, does that mean they’re already dealing with the memory errors at a higher level, or are they just silently failing in subtle ways and not being noticed?
In other words, if ECC is supposed to help with random memory fluctuations, how do computers function WITHOUT it? Wouldn’t random memory errors be like randomly rewriting computer code… and yet somehow it continues to work?
Does this mean that if I have a large spreadsheet open and leave it open for a few days in the background, there’s a chance some of the numbers will be randomly changed, and if I didn’t notice it, they’d be saved back into the file?
They do. However, if you think about it, the majority of the memory space in your computer is not storing information where a bit flip is going to hurt anything. Any floating point numbers stored - a lower order bit flip won’t do anything. Any integers stored where the integer is part of data and the bits aren’t used for later branching operations - bit flips won’t do anything. All those graphics on your screen? Most of those graphics are stored in RAM as various bitmaps and compressed image files. Single bit flips may not cause any errors you can notice.
Out of all the memory in your computer, the bits where a change actually causes a crash is probably less than 1%.
In addition, most consumer desktop users do experience crashes fairly routinely. I’ve had portions of Windows crash (explore.exe has quit working), I’ve had almost every application I’ve ever run crash at least once, and so on. How many of those crashes are due to bit flips? Probably at least a few, though almost all of them are due to programming errors.
Anyways, server operators have economic reason to want to eliminate these crashes, and many servers are running various flavors of Linux kernels that are basically near perfect pieces of software. (there are bugs, but they are incredibly rare) They are running well tested server applications that also do not crash very often.
just to add onto what has already been said, computer memory can store different things:
probably other stuff I’m not remembering.
data is the stuff you (The user) is working with. JPEG images, Word documents, music files, etc. a bit flip in any of these might present as some form of corruption if you see it at all.
code consists of instructions for the CPU to execute. a bit flip in a code page could cause the CPU to receive an invalid instruction which can crash the running program or cause the OS to halt (blue screen, or kernel panic/oops in *nix land.)
pointers are values which tell the CPU where to look in memory for something. if a pointer gets corrupted, it could tell the process to try to access a memory address it is prohibited from touching, which will most likely crash the program. if it happens in kernel-mode (e.g. a hardware driver) the OS will likely crash unless the driver can be restarted w/o halting the OS.
Ars Technica recently had an article showing the effect of a single bit error on a JPEG. It was in the context of the flip happening on a filesystem, as opposed to memory, so it’s a little different (an image is not likely to be as highly compressed in memory as it is on disk), but it was the only other context in which I’ve seen someone try to categorize the effects of a single bit flip.
While nothing above is wrong, the answer can be heavily simplified. The chance of memory errors that actually affect the system is quite small, only increasing the longer you leave your computer on. Most users don’t leave their computers on long enough for it to be a significant problem, and, when a problem does happen, they notice right away, and thus can restart the computer.
Systems with ECC are on for a much longer time and are expected to keep running without human intervention. ECC is used for similar reasons to using RAID (hard drives that are continuously backed up in case of failure), which isn’t used much on home computers.