How does file compression work?

How does a program like PK Zip compress a file to a smaller size? I don’t really see how this can be done other than cutting part of the code out of the original program. Just wondering. Thanks.

google is your friend

All compression algorithms rely on the fact that meaningful information is not random. Given that data is redundant or repetitive, there are ways to compress it, either while losing some information, or retaining all of it.

If you need to recover the exact data stream, there are a number of ways. A simple one is called “Run Length Encoding”. Consider this string of digits:

1100001010100011100000000010111100000000000001100

Now, I could represent that string like this:
1-2,4,1,1,1,1,1,3,3,9,1,1,4,13,2,2

The ‘1’ at the beginning means the first digit is 1. Every number after that tells me how many times to print that number before switching to the other. Each comma represents a transition from 1 to 0.

Now consider this number: 10000000000000000000000000001

Now we can compress it to: 1-1,27,1

This type of technique works well when there are large blocks of unchanging data. For example, a GIF image. If a GIF was 255 pixels wide, and the the top 200 rows were all solid black, it could be represented like this: black(51000), or its digital equivalent. That’s a hell of a lot better than storing 51,000 individual numbers, all of which are the same.

Another method is to spot patterns and ‘tokenize’ them. For instance, consider this paragraph:

George had a father named George. George’s father’s father was also named George. When George and his father and grandfather get together, you have three Georges in the same room. His father thinks that’s crazy, but George understands. I’ll bet his grandfather does too.

Here’s a simple compression of that: First, I’m going to do a pass through it looking for certain patterns that repeat. In this case, the obvious ones are the word ‘George’, and the word ‘father’. So I will build a table of tokens like this:

father - 1
george - 2

Now I can re-write the sentence like this:

2 had a 1 named 2. 2’s 1’s 1 was also named 2. When 2 and his 1 and grand1 get together, you have three 2s in the same room. His 1 thinks that’s crazy, but 2 understands. I’ll bet his grand1 does too.

That’s roughly the way algorithms like the Lempel-Zev compression in PKZIP work. The executable of a program or an image have TONS of repeating patterns. Replace each pattern with a smaller token, and you can compress the file.

Then there is ‘lossy’ compression, which relies on the fact that much information is redundant and unnecessary. For instance, I’ll bet you can understand this sentence:

If u strt walkng nrth, u wll reach the nrth ple in tme.

You might have to work a little bit at deciphering it, but once you do, you can extract all the information that was in it.

Images contain redundant information, and also information that cannot be perceived (very subtle changes in colors, for example). That can be exploited, and by stripping information that is imperceptibe you can compress an image, although it is no longer the same as the original.

Some algorithms combine methods. For instance, consider a movie. Not only does each frame have a lot of repeating patterns, but because the background in a movie stays relatively unchanged from frame to frame, by just storing the parts that change in each frame you can compress the image dramatically. In the extreme case of those old Spider-man cartoons where the background doesn’t even move, you could just store each background once, then all you’d have to store frame-by frame was the one person’s arm or whatever’s moving.

http://www.faqs.org/faqs/compression-faq/part1/

The first part of the comp.compression FAQ. Good reading, if you like this kinda stuff.

FWIW, for any given algorithm, there is a strength of length n that it can’t compress, simply because there aren’t enough possible outputs. Simple counting argument–more later if it’s necessary.

Fascinating way to explain it Sam. Every time I hear ‘algorithms’ and whatnot, my eye begin to gloss over. But the George example and that other actually make sense to me.

Thanks.

Going along with what ultrafilter said, no lossless algorithm (e.g. PKZip, RAR, GIF) can make all inputs of any given size smaller; if it makes some smaller, it must make some larger.

Basic example: you can’t compress all two-bit patterns into one bit, because there are 4 possible two-bit patterns and only 2 possible one-bit patterns. No matter how tricky your decompression algorithm is, if you have one “compressed” bit, it can only have one of two possible values, so you can only produce one of two possible “decompressed” output patterns. The same argument can be extended to any number of bits.

But there are lossy formats like JPEG and MP3, which can work all the time because they don’t promise to restore the exact same pattern after compressing and decompressing. Of course, a lossy algorithm is only useful when you don’t need the exact same pattern - if you compressed a program and it was different when you decompressed it later, that would be catastrophic.

MP3 basically works by removing details that your ear and brain are physically incapable of hearing… e.g., when you hear frequency X, it masks out frequencies Y and Z, so you don’t need to store those parts of the signal. As you lower the bit rate to get better compression, the encoder is more willing to throw out parts of the signal, so it throws out something that you otherwise would have heard (this is why low-bitrate MP3s sound bad).

JPEG works by breaking the image into small blocks, then applying an algorithm called the discrete cosine transform (DCT). Just like you can make any number by combining different powers of 2 in a binary number system (35 = 32 + 2 + 1), you can make any possible block of the image by combining a fixed set of “building blocks” in different amounts; each building block is generated by a specific cosine wave, so it looks like a smooth gradient.

Without getting into a full explanation, the JPEG encoder basically eliminates fine variations in the amounts it uses to combine the building blocks. Instead of “3 parts of block C + 18 parts of block F”, you might end up with “4 parts of block C + 16 parts of block F”.

Since it’s still a smooth gradient and your eye isn’t very sensitive to small changes in color, you don’t notice the loss of information when you look at a JPEG image… unless you try compressing an image with sharp changes in color (like a cartoon), because then you’ll see smooth gradients where there were none before.

Not all. Some algorithms such as Huffman code relies on the fact that some letters of an alphabet appear more frequently than others. Adaptive Huffman code builds a binary tree dynamically. I am sure there are other ways.

Eh, your compression algorithm end up with a longer “compressed file” then the original. If I didn’t count it wrong, the original is 32-bit long, while the compressed file is at least 33 bytes long, more than 8 times the size.

That’s only under a certain scale, sampling rate, etc. When you take a JPEG image and increase the size, it’s going to look bad.

Thanks, Dad. I was trying to explain a concept, not design an efficient algorithm.

If the letters appear with different frequencies, then it’s not random.

Thanks for an excellent explanation, Sam.

Note that one of those two-bit strings could also be compressed to the empty string. Still, that’s not enough target strings.

Random does not mean equally probable. And in the context of computer science, random just means that the system cannot anticipate the next item; as opposed to sequential. If the system does not anticipate the next item, it does not matter if every item is the same, it’s still random.

This is a bit off-topic, but here goes.

What you say is true for samples of infinite size, but not true for any practical files we might encounter. There is absolutely nothing in probably that stops the letter “z” from generated 25% more often than the letter “e,” for example.

d’oh. probably = probability

Preface: I have published research papers on compression and information theory. Pardon me for being blunt. It bothers me when I see people who state something correctly are criticized by people who obviously are not at all expert on the matter.

Huffman coding, when it does work, works because the distribution of character counts is not random. Urban Ranger et al.'s criticism is not warranted.

In fact, in the context of compression and information theory, “random” is defined to mean “not compressible”. So any (lossless) method of any form that can compress a large set of data means that the data is not random by definition.

This stuff all dates back to Church (who first proposed the idea) and Turing (who gave the first practical formalism).

I was using a definition of “random” that is mathematically equivalent to ftg’s (though less circular-sounding). I assume Sam Stone was too. I would like to point out that a completely random sequence of bits is, in general, uncompressable. (Is that right? I’m not an expert at all.)

More on Huffman coding: If you’re not using a standard frequency distribution (such as the letter frequencies in English words), then you have to include your mapping as part of your compressed message. By the time that your message is large enough that this header size is insignificant, you’re to the size where a truly random message will have a frequency distribution close enough to random that Huffman coding will be useless. Some messages will be shrunk, but some will be expanded, and on average, you won’t help the size at all.

The key to practical compression is that you hardly ever want to compress a random message. If, for instance, you want to compress your five thousand page novel, a Huffman code will do the job, since your novel is in English, and English has uneven letter frequencies. You can probably do even better by going to the word or fragment level, instead of the letter level (the word “the” is more common than the letter “q”, for instance). If you want to compress an image, then you’ve probably got either smooth gradients, in which case JPEG will work well, or large uniform blocks, in which case GIF will work well. About the only time you’ll see files on your computer sufficiently “random” to be uncompressible will be the output of your compression programs: Running a zip file through WinZip again won’t really make much difference.

One of my favorite corollaries to compression and randomness is the converse of your statement: if you have data that is compressed with an efficient algorithm, it has the outward appearance of being completely random.

As long as you work to skip any ‘headers’ your compression program might add within your file I should think that .ZIP files would make an effective one-time pad for secret messages.

As in: “Here’s a big data file of seemingly random numbers. To decrypt it, WINZIP disk 3 of Microsoft Excel for Windows, 1.03 from 1988 and XOR the bytes against the zip file.”

AmbushBug
i have the feeling i’ve just exported munitions using the SDMB. shudder

A few months ago, a company claimed to be able to compress “near-random,” practical data by a 100:1 factor, and that you could repeat the process to get even more compression. They made a big press splash around the first of this year, and their press release was picked up by several credulous news agencies, including Reuters (!).

Knowing that it was either a fraud or a joke, I kept my eye on them, checking their web site occasionally for updates. Recently when I looked, they’re no longer there! The company name was ZeoSync, and their web site was www.zeosync.com. Unfortunately, their web page was completely Flash, so Google’s cache feature won’t let you view it. You can still search for ZeoSync and find lots of articles, for example http://www.pcworld.com/news/article/0,aid,86986,00.asp .

They even went so far as to say that their work specifically refuted the work of Claude Shannon. I’d really like to know the rest of the story.

ZeoSync was discussed here in GQ. The consensus of those in the know was that it was a bunch of crap.