OK, for the purposes of this thread, I don’t care if what I’m asking is impractical or that RAM is cheep. I’m just curious if this is possible.
DOS and Windows (and I’m assuming Linux and MAC OS) can scan a hard-drive and mark bad sectors so that they won’t be used.
Could the same thing be done for RAM? Obviously the RAM itself couldn’t be marked, the bad RAM location(s) would have to put in a file, but at any rate, if you had a bad RAM stick with one or more specific addresses that are bad, could an OS work around them?
Sure.
It’s been done in the past, and may still be done with fault-tolerant systems.
But RAM is cheap these days, and it’s usually considered a better idea to throw away an entire stick if part of it is bad.
Typically, the defect mapping is done by the memory system, and not the OS. An example.
Thanks for the answers. I know about error detection and correction, I was just thinking of motherboards that didn’t take RAM with Parity or ECC, or anything like that.
Since the mapping from virtual adresses to physical adresses is under the control of the OS kernel, the answer is basically yes. The OS can keep a table of bad physical pages and never allocate those pages for use. There is an issue for low physical memory addresses, because the kernel will boot in physical address mode, and bad memory in the area needed for early bootstrap can’t be avoided in this manner. But the rest clearly could. In principle.
However detecting bad memory isn’t trivial. ECC can flag correctable and uncorrectable errors, parity memory can also flag single bit errors, but not correct them. It can’t see all dual bit errors. If the OS gets an uncorrectable error there is little it can do except assume that the process that incurred the error is now toast. If it was the kernel, well a panic (or BSOD) is about the only reasonable (if you can call it that) behaviour. So a system that is in some sense tolerant to memory errors (by trying to map out bad memory) is going to probably seeing a lot more hard failures too. Mapping out pages with persistent single bit errors in an ECC system might be a reasonable compromise.
Properly testing memory is appalling difficult. Almost no-one ever does a real memory function test, as they take days. Memory failures can be very subtle. Leaky cells that forget over a period of hours, and worse. So it is going to be difficult to identify true failing memory addresses over random one off failures.
Mapping out the bad memory in the OS has the advantage that you don’t add any overhead to the memory internals - since you are already performing a memory address mapping for virtual memory you have already paid the price.
Bad caches, I agree. But for bad embedded memory, doing a software mapping would be absurdly expensive.
Testing memory functionally is tough, but pretty much all processors use Build In Self Test (BIST) to do embedded memories - and there is also some tools that put the BIST hardware on an ASIC and use the same technique to test external memories. The BIST hardware sits right next to the RAM and tests it at-speed, which is very important. In the old days the BIST would be set up with an algorithm to test for the defects most likely to be seen in a given memory design, but we discovered that nature always inserted defects that we didn’t bother to test for, so many BIST engines are now programmable.
Memories are so big these days that during production their failure rates are very high. To improve yields nearly everyone designs them with redundant rows and columns. During manufacturing test, you run BIST on the memory, find the failing bits, and blow efuses to do the substitution using several clever algorithms.
If you really want fault tolerance, this can be extended to Built-In Self Repair, where on power-up you run BIST, find the defective cells, and use soft fuses or something similar to do the repair. This doesn’t hurt performance. The hardware overhead doesn’t matter since you gain far more yield by doing repairs than you lose.
The Open Sparc definition has ways of marking bits as dirty as a way of implementing mapping for faulty caches. I was on a committee looking at this a few years ago, but I think I tossed my documentation.