Compression Algorithims--I'm clearly missing something

Ok–lemme explain how I think compression algorithms work, then tell my question…and then someone can tell me why I’m wrong. I know my idea/question is flawed…I just don’t know why.

Let’s imagine you have a bmp file. It’s a bunch of 1s and 0s. When you zip or rar a file, you’re saying (and this is the baby talk version) “Take patterns of ones and zeros and convert them into (say) hexidecimal.” So 111111 would be “AAA”, 111110 would be “AAB” and so on. Then when you unzip, it does the reverse: AAA gets turned back into 111111 (and so on). So by turning every six digit group into a three digit group, you’re reducing the file size by half.

So…my question: Assuming this is at least correct in a baby-talk version, why can’t you take it further? Take every group where AAA is followed by AAB (AAAAAB) and convert it to “AAA” and so on.

It’s obvious that it won’t work, because eventually you could compress the Oxford English Dictionary into a single six (or however many) character string. So there’s clearly a flaw in my logic. Also, if you try to zip a Zip file, it rarely gets even marginally smaller and sometimes gets bigger. So it doesn’t work in the real world.

But…why not?


That’s not all like how it works. Take a look at Huffman coding to get a feel for a pretty simple algorithm that works well on text.

ETA: LZW is another famous algorithm, but a bit more difficult to understand.


It’s more complex than your description.

The more you compress something, the more the resulting data is lacking patterns that can be exploited by compression (it’s more random), so you can only go so far.

No, that’s not how compression algorithms work. It doesn’t make sense to convert strings of binary digits into letters, because those letters must themselves be stored as binary digits. Hexadecimal notation is just a human-friendly format for looking at binary data. You can’t store things in hexadecimal; everything is ones and zeroes.

The most basic compression algorithm is Huffman coding, which works like this. Suppose you are compressing a text file that contains a lengthy English novel. It probably has the word “the” a lot of times. In binary (assuming we are using ASCII encoding) the word “the” amounts to the following 24 bits: 01110100 01101000 01100101. What you can do is put a little marker at the beginning of the file that assigns a special code, say 10000001, to stand in for the word “the.” As long as the space that it takes to store the information about this code is less than the space saved by converting every “the” to 10000001, thus saving 16 bits, you have compression.

Huffman coding works by assigning the shortest codes to the things that appear most frequently. There are diminishing returns, because in addition to the original text, you have to store a “dictionary” mapping each special code to what it stands for. This is how you recover the original text. So you can only take it so far. It will work for natural-language texts which have something like a normal distribution of word frequency. It will not work for random data, since every symbol will appear with roughly equal frequency. Indeed, one of the ways to test a data stream for randomness is to try to compress it.

This is an example of lossless compression, because the exact original data is recovered after decompression.

There are also lossy forms of compression, which take advantage of the limits of human perception. JPEG image files, for example, are compressed from the original images by sacrificing some image detail. The algorithm is designed to sacrifice detail in such a way that the important bits (to the human eye) will be preserved well enough so the image looks pretty close to the original. You can never recover the original data from a lossy-compressed file, because information is lost in the process. Lossy compression for audio works the same way. With lossy compression, you can generally tune input parameters to specify how far you want to take it. If you turn it up too high, the result is eventually of such poor quality that it is useless.

ultrafilter’s links can help you better understand some of the ways compression is achieved.

To answer your question simply, compression works be eliminating redundancy in the data. Once you’ve eliminated all, or nearly all, of the redundancy, there’s no way to compress it any further.

That’s the simple answer. It’s actually a little more complicated than that. Depending on the technique, you can reach a point of diminishing returns while there’s still some remaining redundancy.

If you think hexadecimal is fundamentally different from binary, that’s your confusion right there.

Compression is all about probabilities. For example, to represent the 256 different 8-bit bytes you need 256 code words; these are represented most simply with 8 bits … No Compression! But if one of the 8-bit bytes (say ‘7D’) occurs 50% of the time, you represent it with a single bit (say 0), represent the other 255 possibilities with 9 bits each (say 1, followed by the byte) for an average cost of 5 bits … 8:5 compression ratio!

In that example the input file was highly non-random (half its bytes were ‘7D’) but the output will appear much more random. Random == Incompressible.

The above discussion applies to essentially all compression schemes though the idea is often hidden behind high-level reversible transformations which serve to prepare the data for such simple probability-based coding. You can read about a simple yet impressive example of such a transformation sometimes used on text files by Googling for Burrows-Wheeler Transform.

I was starting off on an explanation, but septimus has covered it pretty well. Compression consists of identifying patterns that can be encoded using fewer bits than the pattern uses. The encoded results tend to contain well distributed patterns that can’t easily be encoded using fewer bits. Common compression utilities often check for recompression based on the internal format of a compressed file or directory information, and don’t allow any actual recompression. But if you can find a compression program that allows recompression, or modify compressed file so they aren’t recognized as such, you’ll see that very little, if any, additional compression can be done after the first compression. Because the output contains additional internal format information, additional recompression may actually increase the size of the file.

True. This ties into the math behind compression, mostly in terms of the ‘pigeonhole principle’.

The pigeonhole principle states that if you have five boxes to put things into, one thing per box, you can’t stick six things into boxes.

In terms of compression, a fixed number of bits can only be put into so many specific arrangements (in specific, n bits can be arranged a total of 2[sup]n[/sup] ways, so 256 ways for 8 bits), so if someone claims to have a compression algorithm that can always compress any file, they’re a conman or an idiot because, for example, there are 4 possible 2-bit files (2[sup]2[/sup] = 4) but only 2 possible 1-bit files (2[sup]1[/sup] = 2). So, once you assign bit 0 to the file containing 00, and bit 1 to the file containing 01, what’s the file containing 10 going to be assigned to?

We can actually prove the pigeonhole principle mathematically, but, really, it is that simple. Yet there still are people who can’t quite grasp it.

So, the result is that there is no perfect compression algorithm: For any given algorithm, there are some files it can’t compress at all and will leave the same size, and, likely, some files it will actually make larger. (That said, no piece of prose written in any human language cannot be compressed. Human languages are extremely redundant and can be compressed quite a bit with even primitive algorithms.)

You know how, in Morse code, common letters like ‘E’ and ‘T’ have short codes, while uncommon letters like ‘X’ and ‘Z’ have long codes?

That’s a good scheme. It’s an efficient scheme. The encoded length of different messages is well-tailored to their frequencies.

If it was common for people to use some other, less-efficient scheme for encoding their messages as dots and dashes, you could improve upon it by translating into Morse code.

That’s compression.

Actually, the pigeonhole principle states merely that if you have six items and five boxes, one of the boxes must have multiple items. If that’s unacceptable for your application, then you need more boxes/less items. But the principle itself doesn’t make the claim that you “can’t” put six items in five boxes. Just that each item can’t have a unique box.

Isn’t there a popular compression algorithm that uses something like a tree? I remember learning about something a couple years back where you’d have a tree like

     /  \
   0      1
  /         \
 [A]         []
           0   1
          /      \
        [C]     **

Where the most common bit-strings (or in this case, letters) were on higher leaf nodes. So then the string “bac” (011000100110000101100011) would be represented as 11100.

Or is that just a simple implementation of Huffman coding? On review they seem pretty much the same.

This is without question the most important thing anyone who thinks you can just recompressing stuff needs to understand. If an algorithm can compress some files, then that intrinsically means that others have to get larger.

In general, non-random files (lots of repeating patterns, for example) compress well. Random files don’t compress. The output of compression sort of looks random, so trying to compress it again is not going to work well.

In fact, one of the many definitions for how non-random string is is how well it can be compressed. (Formalized by Kolmogorov if you want to Google for more.)

Do they have to? Couldn’t worst-case scenario for an algorithm be that it stays the same size, or are you counting the compression/decompression utility as part of the file size?

Yes, they have to (at least for lossless compression). In order to be lossless, the compression must be reversible, which means that every bit string must be mapped to a *unique *compressed bit string. I.e. you can’t have distinct bit strings X and Y, compress to the same bit string Z, since if you then try to decompress Z, how would you know whether to give back X or Y?

But there are fewer possible short bit strings than there are possible long bit strings, so since if each long bit string “consumes” the bit string it compresses into, making it unavailable for any other bit string to compress into, only so many long bit strings can compress into sort bit strings before you run out. The rest of the long bit strings have to “compress” to even longer bit strings.

Other people have given lots of good information, but though your premise is partially flawed, this is a good question that hasn’t been answered. The answer is that when decompressing the file, you need to know what to translate ‘AAA’ into - there can’t be two different sequences of data that would both translate to ‘AAA’ in the same place. (Note, this rule does not apply to ‘lossy’ compression, where the output is not the same as the input data but considered ‘close enough’ in human terms - such as MP3 music or JPG photos.)

I definitely recommend going through and building a Huffman tree based on text input, if you want to understand lossless compression algorithms better. If the OP or anyone else is interested, I can summarize some instructions and make some example problems right here. :smiley:

Not necessarily the additional size of the utility, but as a practical matter some data must be added to identify that the contents are in a compressed format, even if that format is unchanged from the original.

It can be done by convention, for example changing the extension of a file. But it is advantageous to have compresses data recognizable from it’s content. In all circumstances, without meta-data, any set of randomly selected data might end up being identical to the compressed result of another set of data. So even a signature format is insufficient in all cases to determine whether or not a set of data is compressed or not.

To have data that can be unambiguously identified as compressed, additional data must be added to the the compressed results. If you don’t mind the ambiguity, worst case could be no change in size.

That just depends on whether the compression algorithm is content specific or not. So to use the morse code example, the frequency of letters is pre-established, and the most frequently used letters are encoded in fewer bits. The compressed data doesn’t have to include any other information about how to decompress it since the same rules are used for any message.

Hey–I wanted to thank everyone for the info–the specific question I had was answered in posts 6 and 8, but the other posts on why my basic understanding of how compression works were great as well. I’m still reading and appreciate all the help!


This example involves taking a 6-hex-character string and encoding it as a 3-hex character string. There are a lot more 6-character strings than 3-character strings.

So, either

  1. You’re only using a small subset of the possible 6-character strings (small enough that there are few enough possibilities that each one can be encoded using just 3 characters). In this case, there’s a lot of duplicated data in your set and you probably can compress it well.
  2. You’re going to “overflow” your short string supply, and you’ll have to start using longer strings. Given the overhead required for compression, you eventually end up getting a larger file than you started with.

Point #2 is pretty easy to demonstrate with common compression algorithms. Take a file and zip it. Then zip that into another zip. Continue. At some point (probably around the 2nd or 3rd iteration), the files start getting larger.

If the above isn’t clear, let’s go back to binary. How would you encode all 2-character binary strings (00, 01, 10, 11) as 1-character strings (0, 1)? This is analogous to your question above. It’s pretty clear that you can’t do that, because there are 4 2-character strings and only 2 1-character strings.