Let me take a quick stab at a non-tech answer to why you can’t compress infinitely. There is a certain amount of information in each file, you cannot compress past the point where that information is defined at maximum efficiency.
A good comprssion algorithm will bring a bloated set of information and alter it so that the information contained within is more efficiently packed. Kind of like re-arranging your kitchen shelves, there might be a better way to pack your pantry, so you don’t have as much unused space, but at some point every box is nested against it’s neighbors, and you can’t make it any smaller.
In layman’s terms, compression is nothing more than a fancy way of describing the data and since the data can already contain any of the possible values in the same set of symbols that will be used to describe the data (that is, unique combinations of 1 and 0), for any given algorithm, there will be a set of data that is more compact than the symbolic description.
For example, if the entire available set of symbols was the lower case letters a to z, we might analyse the data and notice that the string ‘xyz’ appears several times; in order to save space, we decide to replace all instances of ‘xyz’ with ‘a’ (knowing that we can reverse the process at the decompression stage).
Problem is that we might be given a set of data that does actually contain native instances of ‘a’ (which would be erroneously decompressed to ‘xyz’), so we would have to use a different symbol to represent ‘a’ (and whatever symbol we choose could natively appear in another data set, and so on).
With adaptive compression routines (where the symbology is tailored so that the most common sequences are found and only these are replaced by more compact symbols, the data must then be accompanied by a ‘dictionary’ describing which sequences have been compressed and by what they are now represented and there will be sets of data that, when packed together with their dictionary, a re larger than they were to begin with.
Even if the algorithm is smart enough to recognise the uncompressible sequences and just include them verbatim, there’s still the overhead of control codes to tell the decompression routine ‘I couldn’t compress this, so don’t try to decompress it’, resulting in a file bigger than the uncompressed data’.
If it were the case that the available character set contained a number of codes that were guaranteed never to appear in real-world data (for example by using a spare bit - always 9 bit architecture to store only 8 bit data), then everything would be highly compressible, but at the expense of having been far bigger than necessary in the first place.
Then how have we established it, if we haven’t established why?
But we have, really. The number of files shorter than length N is fewer than the number of files of length N, for any N. Therefore there aren’t enough of those shorter files to uniquely represent a given file of length N (for any N whatsoever, remember). You run out of representations, somewhere along the line. Therefore a lossless compression algorithm cannot yield actual compression for every possible input file.
This is the essence of ultrafilter’s post, which I’ve tried to boil down even further.
Then there’s Derleth’s observation:
To elaborate a little: you’d be able to reduce any file whatsoever down to one bit. But how could a little ol’ two-state bit reconstruct the grand infinity of all possible files?
Ultrafilter provided that, but I’ll try another explanation. Let’s start with a file that’s only two bytes long. Let’s say you have a compresser that can compress every file of two bytes into something shorter.
There are 65,536 different possible two-byte files (16 bits, and 2^16 = 65536). If you can compress some of them into fewer bits, there won’t be enough different combinations of files with fewer bits to represent 65536 different ones. Therefore your compresser can’t compress all possible files - it can compress some, but others will have to be made larger.
Now that we’ve proven this for two-byte files, the same argument works for files of any length.
Actually, I know of a compression algorithm which meets your needs. It’s not very efficient, and will usually not result in any decrease in the size of your input. But it’s guaranteed to never increase the size of the input, and if it’s applied repeatedly enough times, it can eventually compress any arbitrary input file to less than a kilobyte. Best of all, both the compression and decompression algorithms are known, and both are very simple.
Somewhat along the lines of what I suspect Chronos is suggesting, there’s also the infinite compression algorithm: The actual algoritm differs for each possible input file, but is basically as follows.
File we want to be able to compress really really well is “X”, where X represents some stream of bytes of any length whatsoever. We don’t care so much about any other file.
Algorithm: If the input file is “X”, then output a one-bit output file (just the bit “1” for example). Otherwise, output the input file (i.e. do nothing).
Viola! We have “perfect” compression, but only for that one file, all others are left alone.
Question hinting at underlying complexity and thus left as an excersise for the reader: What do we do if the input file happens to match the output file for “X”? (i.e. if the input file is just the bit “1”). Hint: answering this is a lot harder than it looks, and basically ends up being a restatement of what ultrafilter said.
Less hard question with almost the same answer: Given that we can find a “perfect” compression algorithm for any given file as above, why can’t we achieve “universal” compression by merely somehow numbering all of the possible perfect compressors and representing each file as the number of it’s compression algorithm? (Hint: how do we do the numbering?)
Even easier observation: There are conditions under which a variation of this algorithm would actually be very useful; so much so that computer programmers do it all the time without even thinking about it.
I’ve been thinking about this, and maybe instead of quoting Shannon and other information theorists, instead if we described how some common compression algorithms work in simple terms, then it will become obvious why you can’t compress infinitely. I mean, once you’ve seen something Huffman coded, you understand right away why it wouldn’t work a second time, assuming you understand the algorithm.
I’d do this myself, but it’s been 13 years since my algorithms class, and I’m 300 miles away from my books.
Lets try simplifying the problem to the point of absurdity; compressing a sequence of four bits: there are sixteen possible combinations of four bits:
0000
0001
0010
0011
0100
0101
0111
1000
1001
1010
1011
1100
1101
1110
1111
In order to compress every possible combination, we have to find sixteen unique binary patters that are shorter than four bits, which isn’t possible; with three bits you only have eight different ways to say something:
000
001
010
011
100
101
110
111
You might still be able to achieve compression for some data sets, if you use your eight different codes to describe the eight most common four-bit sequences (and this will work well if those sequences happen to predominate in the data set), but how do you encode the other eight possibilities? In fact, what you do is to use only seven codes and reserve one as a control prefix to tell the decompressor that we’re dealing with one of the less common items, so you have to do something like:
000 = code for common 4 bit sequence (1)
001 = code for common 4 bit sequence (2)
010 = code for common 4 bit sequence (3)
011 = code for common 4 bit sequence (4)
100 = code for common 4 bit sequence (5)
101 = code for common 4 bit sequence (6)
110 = code for common 4 bit sequence (7)
111 = control prefix for less common sequences, so:
111000 = code for less common 4 bit sequence (8)
111001 = code for less common 4 bit sequence (9)
111010 = code for less common 4 bit sequence (10)
111011 = code for less common 4 bit sequence (11)
111100 = code for less common 4 bit sequence (12)
111101 = code for less common 4 bit sequence (13)
111110 = code for less common 4 bit sequence (14)
Oops! we’ve run out of combinations again, so
111111000 = code for less common 4 bit sequence (15)
111111001 = code for less common 4 bit sequence (16)
There are actually more elegant (or less wasteful) methods than above, but you get the general idea - if you want to be smart/adaptive and choose the shortest codes for the most common element each time, you have to find a way of telling the decompressor what to expect, which means creating a dictionary header, which takes up valuable room, so you can’t even with that way -in describing a file where the distribution is even and no compression is possible (or rather, the gains and losses balance out to zero, or nearly so), you still have to include the dictionary (or yet another unique code to say 'no dictionary this time) so that the decompressor knows where the real data starts and so uncompressible files actually get slightly bigger.
A given size set of data provides the possibility of a calculable number of unique permutations (any of which could occur in the field), a smaller set of data does not have the necessary possible unique permutations to assign to every possible version of the larger set.