Magical Software & File Compression

carl_johnson · November 3, 2004, 7:42am

Note To Mods: My Previous thread (http://boards.straightdope.com/sdmb/showthread.php?t=284480) was closed due to reference to something bad (i think?) I hope its ok to make a new thread with no reference to that bad thing…

Anyhoo, I’ve come across this really cool software trick called parity files. Parity files have a magical quality which works like this: If for some reason one file in a set of many files is missing, one can use a ‘generic’ par file from the set (created at the same time as the set was made) to replace any file in the set that is missing. Using a special program, the par file will generate a copy of the missing file.

I hope that makes sense, for more info google par2 or par files or visit http://sourceforge.net/docman/display_doc.php?docid=17310&group_id=30568

Apparently par files are based on an important CS algorithem known as “Reed-Solomon”, which self-correcting RAID arrays use. Anyways I can’t for the life of me wrap my head around how a ‘generic’ file can replace a specific file (ie. zip file #5 of 10). How can 20 megs of data be used to kind of represent 100 megs?!? Now you can only replace one or two zips with that amount of pars, but I still don’t get how its interchangeable?

carl_johnson · November 3, 2004, 8:17am

To clarify a bit more: Lets say I want to send my publisher a copy of the Bioinformatics textbook i’m writing. He uses gmail which only allows 10meg attachments. My word document is 50 megs (with graphics). I use pkzip to make the .doc file into five 10 meg zips (novel1.zip, novel2.zip etc). If I use .par, I can make an extra 10 meg file which I can send seperately. Now if any of the 5 zip files didn’t arrive in his inbox, he can use the 1 magic file to replace any one file that was missing.

How could this work?

Bryan_Ekers · November 3, 2004, 8:21am

I’d suggest saving the hassle and mail him two CD-ROMs, each with ten copies of the book on them, protected by elaborate encryption (you can email him the necessary keys). The cost is trivial and making multiple copies across more than one disk effectively gaurantees your data will get through, barring a postal strike or whatnot.

Freejooky · November 3, 2004, 8:24am

I don’t know, but that’s a fucking wild idea!

Welcome to the board, CJ. I just gave you an afro, a dope jacket and orange sneakers from SubUrban, and then I reunited the families.

friedo · November 3, 2004, 8:49am

I don’t know the specifics of Reed-Soloman, but here’s how parity bits work in general.

Let’s say you have three files, like so:



File 1: 0 0 0 1 1 1 0 1 1 1 0 0 1
File 2: 1 1 0 1 0 0 1 1 0 1 0 0 1
File 3: 0 0 0 0 0 1 1 0 1 1 1 1 0

For each position in the set of files, we can count how many ones and zeroes there are. For example, if we look at the first bit of each file, we find 0, 1, and 0. If we’re using even parity, then we set the parity bit to 0 if there is an even number of ones. If there are an odd number of ones, we set the parity bit to one. Thus, the parity file for the three files above would be:



File 1: 0 0 0 1 1 1 0 1 1 1 0 0 1
File 2: 1 1 0 1 0 0 1 1 0 1 0 0 1
File 3: 0 0 0 0 0 1 1 0 1 1 1 1 0
Par:    1 1 0 0 1 1 1 1 1 0 0 0 1

Now let’s say we lose File 2, but we have Files 1 and 3, and the parity file. We notice that the first bit of File 1 is a 0, and the first bit of File 3 is also a zero. The first bit of the parity file is a one, meaning there must be an odd number of ones in the set. Therefore, the first bit of File 2 must be a one. File 2 can be fully reconstructed in this manner.

By using this concept in some more complicated ways, you can record enough parity information to deal with even more than one lost file.

carl_johnson · November 3, 2004, 9:36am

Friedo: That was a great explanation of a tricky subject, thank you!

Apparently Parity files and specifically that Reed-Solomon algorithem have plenty other neat uses. When Searching everything2.com for more info, I found this article (http://www.everything2.com/index.pl?node_id=482691&lastnode_id=124) which explains how the principle is used in all cd-players to de-code tiny scratches.

Time to head back to the Pit to figure out why the mod deleted my first thread.

Mangetout · November 3, 2004, 10:18am

There isn’t a space saving at all; engineering redundancy into data requires more storage space than the original data.

Another example of this is the PDF417 (Portable Data File - not the Adobe sort) - one of the more popular ‘2D Barcodes’ - an example of which can be seen here - the amount of redundancy is variable - on the maximum setting, the scanner can still read the content if up to half of the barcode label damaged, obscured or even torn off.
But the downside is that the more redundancy you build in, the more space is required to express the same amount of data and this is always more space than the raw data would occupy.

Chronos · November 3, 2004, 5:21pm

If you get really clever, then you can account not only for lost files, but erroneous ones. A simple parity method will tell you that you have a problem in that case (if you get all of the files, but they don’t add up right), but it will not let you fix it. But suppose you had four files, and thought there was a chance that one of your files might get swapped out for a completely different file. You could add three more files in such a way that you could not only determine that a file is wrong, and which one it is, but correct it.

Voyager · November 3, 2004, 6:16pm

The basic principle comes from information theory, and was used to reconstruct information on noisy channels. friedo explained it perfectly. This is not only not a new thing, it is used by pretty much all microprocessors for large internal memories. Feature sizes are so small these days that cosmic rays will clobber a bit of the memory from time to time. Error correcting codes (ECCs) are used to correct the erroneous data when it is read - which means that there are always a few extra bits in each word of memory. The more bits you have, the more errors you can tolerate. There is also a point where you know that there is an error but cannot correct it.

I’ve never heard of this for files, and wonder why it is needed, since data transmitted on the net usually has some sort of parity in the packets to tell when to retry. It’s a clever idea though, and I assume the last file gets padded somehow so that the file sizes are all equal.

daffyduck · November 3, 2004, 8:24pm

I’ve never heard of this for files, and wonder why it is needed, since data transmitted on the net usually has some sort of parity in the packets to tell when to retry.

It is a common practice when posting larger binaries to usenet servers since propagation among servers is different than simple ftp.

It really isn’t necessary to know what type of file is involved to understand how parity sets work but it was, I believe, the mention of the type of file being downloaded that made samclem uneasy about the first post. I use parity sets when posting my own original video productions made with Poser and Bryce so I can assure you that parity sets are not only for questionable material.

gotpasswords · November 3, 2004, 11:17pm

It contained a link to a site dealing in warez. That in and of itself will get a thread clobbered as warez are rarely 100% legal.

From the SDMBrules:

You agree not to post material that in our opinion fosters or promotes activity that is illegal in the U.S.

carl_johnson · November 4, 2004, 1:37am

I’m hardly claiming the mod seriously infringed on my freedom of speech, i just think they showed poor judgement. The page that was linked was a FAQ on Parity Files, the host had zero illegal content on their server, just a bunch of other FAQS and info.

no big deal, i got all the info i was looking for, therefore the system works.

Chairman_Pow · November 4, 2004, 3:10am

Am I the only one who finds it ironic that someone with the handle gotpasswords is the one who pointed out the potential rules infringement?

samclem · November 4, 2004, 4:20am

carl_johnson.

I closed it because I was unsure if it was heading in an illegal direction. I received a “reported post” from another doper who is rather, IMHO, a savvy person in the computer world. I tend to take his word for things, as my knowledge is about medium.

I thought that any info provided might lead to illegal things. NOT that YOU were looking to do anything illegal, but that it might head that way.

When you opened your OP with

Do you see how I could misconstrue that as, ahem, an attempt to circumvent things?

I’m glad you got your info, and that the world seems to still be spinning.

gotpasswords · November 4, 2004, 10:17pm

I’m an information security administrator. Where’s the irony in pointing out the rules?

Chairman_Pow · November 4, 2004, 11:30pm

Well, warez sites often have serial numbers/lists of passwords for sites. Perhaps the comment was a little obscure and not terribly funny.

Topic		Replies	Views
Reed-Solomon Algorithem - Computer Science Question Factual Questions	1	1289	November 3, 2004
How does file compression work? Factual Questions	38	2434	August 3, 2002
File compression: the general case Factual Questions	61	2296	November 2, 2001
How Big of a Piece of a CD is needed??? Factual Questions	27	3655	May 30, 2014
Compression Algorithims--I'm clearly missing something Factual Questions	36	4770	August 7, 2012

Magical Software & File Compression

Related topics