Anyhoo, I’ve come across this really cool software trick called parity files. Parity files have a magical quality which works like this: If for some reason one file in a set of many files is missing, one can use a ‘generic’ par file from the set (created at the same time as the set was made) to replace any file in the set that is missing. Using a special program, the par file will generate a copy of the missing file.
Apparently par files are based on an important CS algorithem known as “Reed-Solomon”, which self-correcting RAID arrays use. Anyways I can’t for the life of me wrap my head around how a ‘generic’ file can replace a specific file (ie. zip file #5 of 10). How can 20 megs of data be used to kind of represent 100 megs?!? Now you can only replace one or two zips with that amount of pars, but I still don’t get how its interchangeable?
To clarify a bit more: Lets say I want to send my publisher a copy of the Bioinformatics textbook i’m writing. He uses gmail which only allows 10meg attachments. My word document is 50 megs (with graphics). I use pkzip to make the .doc file into five 10 meg zips (novel1.zip, novel2.zip etc). If I use .par, I can make an extra 10 meg file which I can send seperately. Now if any of the 5 zip files didn’t arrive in his inbox, he can use the 1 magic file to replace any one file that was missing.
I’d suggest saving the hassle and mail him two CD-ROMs, each with ten copies of the book on them, protected by elaborate encryption (you can email him the necessary keys). The cost is trivial and making multiple copies across more than one disk effectively gaurantees your data will get through, barring a postal strike or whatnot.
For each position in the set of files, we can count how many ones and zeroes there are. For example, if we look at the first bit of each file, we find 0, 1, and 0. If we’re using even parity, then we set the parity bit to 0 if there is an even number of ones. If there are an odd number of ones, we set the parity bit to one. Thus, the parity file for the three files above would be:
Now let’s say we lose File 2, but we have Files 1 and 3, and the parity file. We notice that the first bit of File 1 is a 0, and the first bit of File 3 is also a zero. The first bit of the parity file is a one, meaning there must be an odd number of ones in the set. Therefore, the first bit of File 2 must be a one. File 2 can be fully reconstructed in this manner.
By using this concept in some more complicated ways, you can record enough parity information to deal with even more than one lost file.
There isn’t a space saving at all; engineering redundancy into data requires more storage space than the original data.
Another example of this is the PDF417 (Portable Data File - not the Adobe sort) - one of the more popular ‘2D Barcodes’ - an example of which can be seen here - the amount of redundancy is variable - on the maximum setting, the scanner can still read the content if up to half of the barcode label damaged, obscured or even torn off.
But the downside is that the more redundancy you build in, the more space is required to express the same amount of data and this is always more space than the raw data would occupy.
If you get really clever, then you can account not only for lost files, but erroneous ones. A simple parity method will tell you that you have a problem in that case (if you get all of the files, but they don’t add up right), but it will not let you fix it. But suppose you had four files, and thought there was a chance that one of your files might get swapped out for a completely different file. You could add three more files in such a way that you could not only determine that a file is wrong, and which one it is, but correct it.
The basic principle comes from information theory, and was used to reconstruct information on noisy channels. friedo explained it perfectly. This is not only not a new thing, it is used by pretty much all microprocessors for large internal memories. Feature sizes are so small these days that cosmic rays will clobber a bit of the memory from time to time. Error correcting codes (ECCs) are used to correct the erroneous data when it is read - which means that there are always a few extra bits in each word of memory. The more bits you have, the more errors you can tolerate. There is also a point where you know that there is an error but cannot correct it.
I’ve never heard of this for files, and wonder why it is needed, since data transmitted on the net usually has some sort of parity in the packets to tell when to retry. It’s a clever idea though, and I assume the last file gets padded somehow so that the file sizes are all equal.
I’ve never heard of this for files, and wonder why it is needed, since data transmitted on the net usually has some sort of parity in the packets to tell when to retry.
It is a common practice when posting larger binaries to usenet servers since propagation among servers is different than simple ftp.
It really isn’t necessary to know what type of file is involved to understand how parity sets work but it was, I believe, the mention of the type of file being downloaded that made samclem uneasy about the first post. I use parity sets when posting my own original video productions made with Poser and Bryce so I can assure you that parity sets are not only for questionable material.
I’m hardly claiming the mod seriously infringed on my freedom of speech, i just think they showed poor judgement. The page that was linked was a FAQ on Parity Files, the host had zero illegal content on their server, just a bunch of other FAQS and info.
no big deal, i got all the info i was looking for, therefore the system works.
I closed it because I was unsure if it was heading in an illegal direction. I received a “reported post” from another doper who is rather, IMHO, a savvy person in the computer world. I tend to take his word for things, as my knowledge is about medium.
I thought that any info provided might lead to illegal things. NOT that YOU were looking to do anything illegal, but that it might head that way.
When you opened your OP with
Do you see how I could misconstrue that as, ahem, an attempt to circumvent things?
I’m glad you got your info, and that the world seems to still be spinning.