Would this be possible for an OS

Nobody · December 18, 2009, 3:59am

OK, for the purposes of this thread, I don’t care if what I’m asking is impractical or that RAM is cheep. I’m just curious if this is possible.

DOS and Windows (and I’m assuming Linux and MAC OS) can scan a hard-drive and mark bad sectors so that they won’t be used.

Could the same thing be done for RAM? Obviously the RAM itself couldn’t be marked, the bad RAM location(s) would have to put in a file, but at any rate, if you had a bad RAM stick with one or more specific addresses that are bad, could an OS work around them?

Thanks.

Xema · December 18, 2009, 4:13am

Easily possible, but the overhead of keeping track might prevent it from being useful.

beowulff · December 18, 2009, 4:16am

Sure.
It’s been done in the past, and may still be done with fault-tolerant systems.
But RAM is cheap these days, and it’s usually considered a better idea to throw away an entire stick if part of it is bad.
Typically, the defect mapping is done by the memory system, and not the OS.
An example.

Nobody · December 18, 2009, 4:23am

Thanks for the answers. I know about error detection and correction, I was just thinking of motherboards that didn’t take RAM with Parity or ECC, or anything like that.

Squink · December 18, 2009, 4:39am

1993? Here’s a similar scheme from 1980:

A 1-Mbit full-wafer MOS RAM
Egawa, Y.; Wada, T.; Ohmori, Y.; Tsuda, N.; Masuda, K.
Electron Devices, IEEE Transactions on
Volume 27, Issue 8, Aug 1980 Page(s): 1612 - 1621
Digital Object Identifier
Summary: A 3-in-diameter MOS RAM wafer with 1.256-Mbit net capacity has been designed. It is organized as two independent 32K word × 20 bit memories. Each 32K word memory comprises forty-six 20-kbit storage units, which, whether defective or not, are permanently connected to both buses and the power supply. Bit substitution and 32-word block substitution are used to counteract defects in storage units, and each wafer uses a combination of countermeasures against defects in its peripheral part. Fabricated RAM wafers showed 400-ns typical access time and 4.7-W power consumption at a 600-ns cycle time.

Francis_Vaughan · December 18, 2009, 11:00am

Since the mapping from virtual adresses to physical adresses is under the control of the OS kernel, the answer is basically yes. The OS can keep a table of bad physical pages and never allocate those pages for use. There is an issue for low physical memory addresses, because the kernel will boot in physical address mode, and bad memory in the area needed for early bootstrap can’t be avoided in this manner. But the rest clearly could. In principle.

However detecting bad memory isn’t trivial. ECC can flag correctable and uncorrectable errors, parity memory can also flag single bit errors, but not correct them. It can’t see all dual bit errors. If the OS gets an uncorrectable error there is little it can do except assume that the process that incurred the error is now toast. If it was the kernel, well a panic (or BSOD) is about the only reasonable (if you can call it that) behaviour. So a system that is in some sense tolerant to memory errors (by trying to map out bad memory) is going to probably seeing a lot more hard failures too. Mapping out pages with persistent single bit errors in an ECC system might be a reasonable compromise.

Properly testing memory is appalling difficult. Almost no-one ever does a real memory function test, as they take days. Memory failures can be very subtle. Leaky cells that forget over a period of hours, and worse. So it is going to be difficult to identify true failing memory addresses over random one off failures.

Mapping out the bad memory in the OS has the advantage that you don’t add any overhead to the memory internals - since you are already performing a memory address mapping for virtual memory you have already paid the price.

goldmund · December 18, 2009, 4:55pm

There’s actually a Linux kernel module available to do this: BadRAM: Linux kernel support for broken RAM modules

AFAIK you do need to use a memory tester such as memtest86 to map out the bad bytes first.

Voyager · December 18, 2009, 6:37pm

Bad caches, I agree. But for bad embedded memory, doing a software mapping would be absurdly expensive.

Testing memory functionally is tough, but pretty much all processors use Build In Self Test (BIST) to do embedded memories - and there is also some tools that put the BIST hardware on an ASIC and use the same technique to test external memories. The BIST hardware sits right next to the RAM and tests it at-speed, which is very important. In the old days the BIST would be set up with an algorithm to test for the defects most likely to be seen in a given memory design, but we discovered that nature always inserted defects that we didn’t bother to test for, so many BIST engines are now programmable.

Memories are so big these days that during production their failure rates are very high. To improve yields nearly everyone designs them with redundant rows and columns. During manufacturing test, you run BIST on the memory, find the failing bits, and blow efuses to do the substitution using several clever algorithms.
If you really want fault tolerance, this can be extended to Built-In Self Repair, where on power-up you run BIST, find the defective cells, and use soft fuses or something similar to do the repair. This doesn’t hurt performance. The hardware overhead doesn’t matter since you gain far more yield by doing repairs than you lose.

The Open Sparc definition has ways of marking bits as dirty as a way of implementing mapping for faulty caches. I was on a committee looking at this a few years ago, but I think I tossed my documentation.

Topic		Replies	Views
Windows bad RAM workaround? Factual Questions	6	1382	December 16, 2007
Recommend a good (free) RAM diagnostic program In My Humble Opinion	1	742	June 20, 2002
RAM testing machine? Factual Questions	4	1906	June 1, 2014
Goddamned memory! The BBQ Pit	5	940	November 13, 2005
Different Hard Drive failure question!!! Factual Questions	2	779	May 5, 2005

Would this be possible for an OS

Related topics