I have a little project that requires error-correcting codes, but so far my research hasn’t really run across a good introduction to the subject. My specific need is for a very computationally efficient code–ideally something that works in a lookup table.
I ran across the BCH(7,4) code, which sounds fairly nice–encode 4 bits into 7, with 1-bit correction. Trouble is, 8 bits is better for my purposes, and I’d like to actually get something with that extra bit instead of just wasting it. Also, I couldn’t find any nice simplified descriptions of the code that would be useful in actually implementing it.
So I decided to just search for a code. I coded up a really naïve program that looks like it should take around 256!/(256-16)! operations, but in practice has enough early-outs that it finishes instantly. And indeed, it found a (8,4) code with 1-bit correction and 2-bit detection.
Here are the 16 codewords:
00000000
00001111
00110011
00111100
01010101
01011010
01100110
01101001
10010110
10011001
10100101
10101010
11000011
11001100
11110000
11111111
A very pretty pattern, I thought! Some nice symmetries in there. Each word has 0, 4, or 8 bits set. It’s obvious also that there is a general class of 1-bit correcting/2-bit detecting codes where the codewords have bitcounts of 4N. A single-bit error means you have 4N+1 or 4N-1 bits set, which can’t be confused with 4(N+1)-1 or 4(N-1)+1. There is of course overlap between 4N+2 and 4(N+1)-2, so for that you would only get detection.
Obviously, a code this simple has been discovered already–so which one is it? I perused the list on Wikipedia but didn’t really see a match.
I haven’t really checked, but isn’t it just the (8,4) Hamming code (the (7,4) code extended with an extra parity bit)? Here’s a list (right side of the table) of the codewords; from a quick glance, it looks like yours can be obtained from them by a bit shift.
Yes, it’s Hamming’s (7,4) augmented as Wiki desribes: “If, in addition, an overall parity bit (bit 0) is included, the code can detect (but not correct) any two-bit error, making a SECDED code.” Any such code can be rearranged without affecting its properties, as OP has done here to make certain symmetries more apparent.
(BTW, if you remove any column from your array, and all the rows with a 1 in that column, you’ll get an elegant 8-sized group frequently seen in combinatorial solutions.)
Once upon a time my work as circuit designer and diagnostician involved me with the nitty-gritty of IBM 370 ECC, similar to your scheme but (72,8) instead of (8,4). There were various details that seemed amusing when sleep-deprived. For example, the same circuitry that checked ECC also developed ordinary parity bits to pass on the bus to CPU. Those parity bits were generated before the data was corrected, so had to themselves be corrected. There were certain triple-bit error combinations which would cause Storage Data Bus Out parity checks!
Good catch, folks. Somehow I had missed that section of the Hamming wiki. Also, I suppose I should have guessed that it was a Hamming code, given that I observed it being a perfect code. It’s sorta trivially obvious that a perfect code for 1-bit ECC must have 2^n-1 length. 7 bits fits.
'Course, now I’m curious about the next one I found: a [12,7] SECDED code. It’s no longer a perfect code; it has 87.5% efficiency. Clearly not a Hamming. Spoilered, since it’s fairly big:
[/spoiler]The same patterns exist here as in my original code. As expected, each codeword has 0, 4, 8, or 12 bits set. I also find it curious that a totally naïve search is still instant. It never actually has to backtrack: the algorithm is really just “pick the first unused codeword that doesn’t violate the neighbor constraints, and repeat”.
I was hoping that a [12,8] SEC code existed, since 2^12/(1+12) > 2^8. But that hasn’t been forthcoming.
(I see I got the terminology backward – what I called IBM’s (72,8) code was actually a (72,64) code.)
The key bound, IIRC, is that K check bits can provide SECDED for at most 2[SUP]K-1[/SUP] total bits, i.e. (2[SUP]K-1[/SUP]-K) data bits. So your 2nd code, with K=5, could check 11 data bits; and thus has an easy job checking only 7 data bits.
For SEC, instead of SECDED, a check bit can be saved so your (12,8) is no problem.
Ok, interesting. I’m not a huge fan of thinking of codes in terms of check bits–as I see it, there’s no reason to make any distinction between data and check bits–but a bound is a bound. As it turns out, the only reason I couldn’t find a [12,8] SEC code was a typo on my part. Only 81% efficiency, but that’s still not bad.
Indeed. A few times, I did try to see if I could pack one more codeword in for configurations with a low efficiency, but so far I haven’t found a case where it works. Doesn’t mean it can’t, though.
Fair enough! FWIW, my application is for flash memory on a nanosatellite. Bitflips happen…
For any real, practical error correcting code, you want to first figure out what sorts of errors you’re most likely to have. Then there are considerations like how common you expect errors to be, what the consequences are for lost data, and whether you have two-way communications or only one-way.
For instance, if errors are rare enough and you have two-way communication, you could just stick a single parity bit on a whole bunch of data bits, and if it doesn’t match up, just ask the sender to re-send that whole bunch. Alternately, though, if you’ve got transposition errors, a single parity bit will never notice the difference. Or maybe your errors are rare but tend to clump, such that you can get two, or three, or a dozen bits in a row all flipped, or all forced to 0, or all forced to 1. Maybe you’ve got problems with analog to digital conversion, such that some bits are in between and you might assign them to the wrong end, but you’re pretty sure which bits are the suspect ones.
Rather than re-inventing the wheel, you probably want to dig into the literature and find someone else who’s been working to similar constraints to yours, and just use their solution off-the-shelf.
We could probably stand to do more digging, but as far as we know, single-bit correction is good for our purposes. This is for flash memory, which as it degrades tends to produce isolated stuck bits. There’s no real clustering, so there’s no need to handle long strings of errors, like you’d get with radio communication or optical storage.
We’ll also have block-level CRC, though that’s more to detect if a flash page is damaged enough that we shouldn’t use it again. Obviously, since this is storage, there’s no ability to rerequest the data.
Also, we have resource constraints, both in physical hardware and manpower. A 90% solution that ships is better than a 99.9% solution that doesn’t. It’s not the end of the world if some errors still fall through the cracks, but it would be nice to have some basic protection.