View Full Version : Can compression algorithms be applied successively?
Scarf-Ace
09-20-2004, 03:53 PM
What is the limit of applying compression algorithms?
I mean, if algorithm A reduces file size by 10%, and algorithm B by 3%, could these algorithms be applied alternately to a file to keep reducing its size indefinitely?
Nope.
Different algorithms have different ways of compressing, and therefore can compress by differing amounts. But in general, applying the same algorithm twice doesn't produce any significant shrinkage in most cases (and actually makes the file bigger in a lot more cases--an important fact about compression). If your run a file thru algorithm A and it compresses 10%, and thru B and B compresses it further by 1%, then algorithm A wasn't that good in the first place. It is quite likely that you if just ran algorithm B alone you would have done better.
Read the sci.compression FAQ (http://www.faqs.org/faqs/compression-faq/part1/preamble.html) for more info.
If you have a burning desire to be well known as the Internet's Biggest Idiot, post to sci.compression that you have an amazing method to compress any data without loss repeated.
Musicat
09-20-2004, 04:05 PM
Asusming non-lossy compression schemes, and assuming you wish to eventually recreate the original data, think about it for a minute. If you could re-apply compression schemes to the same data, and each time your output was smaller, you could eventually reduce any block of data to a single byte. But why stop there? Why not a single bit?
Therefore, logically, a single byte or bit cannot represent ALL possible large data sets and this scheme will not work.
In a practical sense, yes, you can re-apply compression algorithms, but the amount of space saved becomes less and less. Eventually, it may expand, and the processing time may become an unwelcome factor anyway.
Such are the laws of the universe. There ain't no free lunch, as they say.
mks57
09-20-2004, 04:09 PM
What is the limit of applying compression algorithms?
I mean, if algorithm A reduces file size by 10%, and algorithm B by 3%, could these algorithms be applied alternately to a file to keep reducing its size indefinitely?
No. Compression algorithms take advantage of order and redundancy in the source data to produce an output that uses less space. Statistically, the output has lost most of the order and redundancy that were present in the source data. It looks more like random noise. A compression algorithm will not reduce the size of a file containing random noise.
There is no such thing as a compression algorithm that will always reduce the size of a file. This can be proven mathematically. Compression algorithms are designed to work on the subset of files that share certain statistical properties.
Derleth
09-20-2004, 04:10 PM
No. Otherwise, you would be able to reduce a large file to one bit.
In more detail, compression essentially reduces the repetition in the file. It finds symbols that can stand in for longer sequences of data, and the emits a file with a dictionary of symbols and then a sequence of those symbols corresponding to the original information. Therefore, if the first algorithm did its job, there (ideally) shouldn't be any repitition in the data anymore, making further compression pointless.
The methods of finding those symbols in the most efficient way and representing them in the most compact form are, to some extent, still being developed, with new and different algorithms coming out every so often. For example, the traditional *nix compress utility has been replaced by the gzip program, and gzip is slowly being replaced by the more efficient bzip2.
Plus, compression is an interesting problem that can provide grist for endless computer science courses and a corresponding amount of research papers and testbed implementations and other such products of academia. It's so interesting because the basis of the theory, the idea of finding patterns in arbitrary data, is so close to what the human mind does all the time. In other words, we are all compressing our experiences into ideas, words, and conceptual frameworks, because if we could not, we would be swamped by an excess of input.
The Wikipedia article on Data Compression is a good jumping-off point. (http://en.wikipedia.org/wiki/Data_compression) But try Google and other sources of information, there's a lot of info out there.
(Hm. I wonder if it's possible to compress simulposts. ;))
friedo
09-20-2004, 04:11 PM
Not really. The performance of a particular compression algorithm depends on the type of data you're dealing with, and the intended results of the compression. Text data can be compressed losslessly pretty well, because there are lots of elements and patterns that are repeated. You can substitute them with small replacement identifiers and change them back later. This type of compression does not work for data with more randomness.
There's also the question of lossy versus lossless compression. When compressing a book or piece of scientific data, you don't want to lose everything. Every letter or number is important. When compressing images or sounds, however, some loss of fine detail may be acceptable in exchange for the performance benefit. That is how formats like JPG, and MP3 compress images and sounds from their raw formats. Once that data is lost, it can never be re-created. (Unless you keep the original, of course.)
So, in short, different types of compression are desinged for different problems, and how they perform depends on whether you're using them for the right job.
friedo
09-20-2004, 04:12 PM
Damn, that was a helluva simulpost. :cool:
jhinman
09-20-2004, 04:19 PM
What is the limit of applying compression algorithms?
I mean, if algorithm A reduces file size by 10%, and algorithm B by 3%, could these algorithms be applied alternately to a file to keep reducing its size indefinitely?
I don't remember the math form university but there is a limit to how much something can be compressed. I will try to give an example as best I can.
If we could apply A Then B then A then B then A and so on and get more compression each time we could then get a file that is exactly 1 byte long. The problem here is a byte has only 256 values. As soon as we have compress 257 different files down to 1 byte we will have at least 2 files that compressed to the same value. When more then 1 file compress to the same value how are we to know what that value is suppose to expand too?
I hope that helps. Other wise I am sure some one will come along to explaine it better then I can.
Derleth
09-20-2004, 04:21 PM
Damn, that was a helluva simulpost. :cool:I think this is one of the questions you can be sure will draw out all of the geeks on a given message board. :D
jhinman
09-20-2004, 04:22 PM
I think this is one of the questions you can be sure will draw out all of the geeks on a given message board. :D
Hey I resembled that remark. :)
Scarf-Ace
09-20-2004, 04:28 PM
I think I'd have to see some evidence of testing before I made my mind up.
Derleth
09-20-2004, 04:34 PM
I think I'd have to see some evidence of testing before I made my mind up.Evidence of what? That you cannot reduce a file to one bit and the successfully regenerate it? That sometimes a compression program will increase the size of a file?
This sentence desperately needs context.
Mangetout
09-20-2004, 04:54 PM
About a year or so ago (maybe two years), there was a guy making bold claims about a new compression methodology that would surprise everyone because it could compress anything at all much more efficiently than anything we currently have.
Of course I know this is bullshit, but I'm curious to know exactly when, how and where his scheme fizzled out - was he just baiting investors?
ultrafilter
09-20-2004, 06:34 PM
I think I'd have to see some evidence of testing before I made my mind up.
There's no need for evidence, cause we have proof.
Think of it this way: a file is nothing more than a sequence of n bits, and there are 2n possible files of length n. There are 1 + 2 + 4 + ... + 2n - 1 files of length less than n, which adds up to 2n - 1. So there are fewer files of length less than n than there are of length n. Furthermore, roughly 75% of those files are at most 2 bits shorter than n.
Any lossless compression algorithm (i.e., one that allows you to completely extract the original information) needs to map distinct inputs to distinct outputs, so it's going to map 75% of all its inputs to files that are basically the same length. Since there aren't enough shorter files, it also has to leave at least one file at at least the same size.
Now, that's ignoring the issue of compression headers, which tell you how to uncompress data. They take up space as well, and are usually more than 2 bytes. So even with an ideal compression algorithm, 75% of all inputs will get larger.
Real compression algorithms are specialized--they work really well on one kind of data (text, sound, video), but not so hot on others.
yoyodyne
09-20-2004, 06:42 PM
About a year or so ago (maybe two years), there was a guy making bold claims about a new compression methodology that would surprise everyone because it could compress anything at all much more efficiently than anything we currently have.
Of course I know this is bullshit, but I'm curious to know exactly when, how and where his scheme fizzled out - was he just baiting investors?
You're probably thinking of ZeoSync (http://www.wired.com/news/technology/0,1282,49599,00.html), although there are others (http://datacompression.info/IncredibleClaims.shtml).
While it is impossible to compress truely random data losslessly, if you're willing to accept loss there is no limit (http://lzip.sourceforge.net/faq.html) to what can be done!
Mangetout
09-20-2004, 06:50 PM
You're probably thinking of ZeoSync (http://www.wired.com/news/technology/0,1282,49599,00.html)That's the fella! - actually, on reading it now, I can't believe anyone even took it seriously for a moment; it's very obviously insincere technobabble.
Derleth
09-20-2004, 08:29 PM
Real compression algorithms are specialized--they work really well on one kind of data (text, sound, video), but not so hot on others.Well, not really: The most-used algorithms are generalists, because you want to standardize on one as much as you can to simplify things. And the compression algorithms usually used on sound, images, and video are lossy, which would be pretty stupid to apply to text or compiled programs or the like.
So there might be an algorithm optimized for text, or optimized for the machine code of a specific chip (for a real example of this, consider the Thumb opcodes for ARM chips (http://www.embedded.com/showArticle.jhtml;jsessionid=00HOBQUAUSZTCQSNDBCSKHQ?articleID=15200241)), but they won't gain much marketshare because people really don't like to switch algorithms all the time.
Popup
09-21-2004, 05:23 AM
Well, there's always the famous compression sort transform (http://www.pdv-systeme.de/users/martinv/pgg/02U/02U021.html), which can compress an arbitrary bitstream to ridiculously small files.
(The only problem is that there's no known de-compressor...;))
Scarf-Ace
09-21-2004, 09:17 AM
Ultrafilter:
There's no need for evidence, cause we have proof.
Think of it this way: a file is nothing more than a sequence of n bits, and there are 2n possible files of length n. There are 1 + 2 + 4 + ... + 2n - 1 files of length less than n, which adds up to 2n - 1. So there are fewer files of length less than n than there are of length n. Furthermore, roughly 75% of those files are at most 2 bits shorter than n.
Any lossless compression algorithm (i.e., one that allows you to completely extract the original information) needs to map distinct inputs to distinct outputs, so it's going to map 75% of all its inputs to files that are basically the same length. Since there aren't enough shorter files, it also has to leave at least one file at at least the same size.
Now, that's ignoring the issue of compression headers, which tell you how to uncompress data. They take up space as well, and are usually more than 2 bytes. So even with an ideal compression algorithm, 75% of all inputs will get larger.
Real compression algorithms are specialized--they work really well on one kind of data (text, sound, video), but not so hot on others.
I don't agree that this constitutes proof, although i'm sure you have a point.
Obviously we have established that files can't be infinitely compressed, but I don't think we've established exactly why.
crozell
09-21-2004, 09:35 AM
Ultrafilter:
I don't agree that this constitutes proof, although i'm sure you have a point.
Obviously we have established that files can't be infinitely compressed, but I don't think we've established exactly why.
It is well known from information theory that the expected number of bits that you will need to exactly represent a file is related to the amount of randomness in the source (i.e., the entropy). If you want a formal proof of this, it is known as the "source coding theorem" by Shannon. The technical proof is detailed, but not terribly difficult. I would reproduce it here, but I don't know how to do the fancy math notation on SDMB. Check out Cover & Thomas pg 86 or Blahut pg 51 (both classic information theory texts).
Cheesesteak
09-21-2004, 10:01 AM
Let me take a quick stab at a non-tech answer to why you can't compress infinitely. There is a certain amount of information in each file, you cannot compress past the point where that information is defined at maximum efficiency.
A good comprssion algorithm will bring a bloated set of information and alter it so that the information contained within is more efficiently packed. Kind of like re-arranging your kitchen shelves, there might be a better way to pack your pantry, so you don't have as much unused space, but at some point every box is nested against it's neighbors, and you can't make it any smaller.
Mangetout
09-21-2004, 10:30 AM
In layman's terms, compression is nothing more than a fancy way of describing the data and since the data can already contain any of the possible values in the same set of symbols that will be used to describe the data (that is, unique combinations of 1 and 0), for any given algorithm, there will be a set of data that is more compact than the symbolic description.
For example, if the entire available set of symbols was the lower case letters a to z, we might analyse the data and notice that the string 'xyz' appears several times; in order to save space, we decide to replace all instances of 'xyz' with 'a' (knowing that we can reverse the process at the decompression stage).
Problem is that we might be given a set of data that does actually contain native instances of 'a' (which would be erroneously decompressed to 'xyz'), so we would have to use a different symbol to represent 'a' (and whatever symbol we choose could natively appear in another data set, and so on).
With adaptive compression routines (where the symbology is tailored so that the most common sequences are found and only these are replaced by more compact symbols, the data must then be accompanied by a 'dictionary' describing which sequences have been compressed and by what they are now represented and there will be sets of data that, when packed together with their dictionary, a re larger than they were to begin with.
Even if the algorithm is smart enough to recognise the uncompressible sequences and just include them verbatim, there's still the overhead of control codes to tell the decompression routine 'I couldn't compress this, so don't try to decompress it', resulting in a file bigger than the uncompressed data'.
If it were the case that the available character set contained a number of codes that were guaranteed never to appear in real-world data (for example by using a spare bit - always 9 bit architecture to store only 8 bit data), then everything would be highly compressible, but at the expense of having been far bigger than necessary in the first place.
Bytegeist
09-21-2004, 10:41 AM
Obviously we have established that files can't be infinitely compressed, but I don't think we've established exactly why.
Then how have we established it, if we haven't established why?
But we have, really. The number of files shorter than length N is fewer than the number of files of length N, for any N. Therefore there aren't enough of those shorter files to uniquely represent a given file of length N (for any N whatsoever, remember). You run out of representations, somewhere along the line. Therefore a lossless compression algorithm cannot yield actual compression for every possible input file.
This is the essence of ultrafilter's post, which I've tried to boil down even further.
Then there's Derleth's observation:
Otherwise, you would be able to reduce a large file to one bit.
To elaborate a little: you'd be able to reduce any file whatsoever down to one bit. But how could a little ol' two-state bit reconstruct the grand infinity of all possible files?
CurtC
09-21-2004, 10:53 AM
Obviously we have established that files can't be infinitely compressed, but I don't think we've established exactly why.Ultrafilter provided that, but I'll try another explanation. Let's start with a file that's only two bytes long. Let's say you have a compresser that can compress every file of two bytes into something shorter.
There are 65,536 different possible two-byte files (16 bits, and 2^16 = 65536). If you can compress some of them into fewer bits, there won't be enough different combinations of files with fewer bits to represent 65536 different ones. Therefore your compresser can't compress all possible files - it can compress some, but others will have to be made larger.
Now that we've proven this for two-byte files, the same argument works for files of any length.
ultrafilter
09-21-2004, 11:00 AM
I don't agree that this constitutes proof, although i'm sure you have a point.
The information theorists disagree with you.
Mangetout
09-21-2004, 11:03 AM
Yeah, but they're probably just guessing or something - what do they know about it, really?
Chronos
09-21-2004, 03:20 PM
Actually, I know of a compression algorithm which meets your needs. It's not very efficient, and will usually not result in any decrease in the size of your input. But it's guaranteed to never increase the size of the input, and if it's applied repeatedly enough times, it can eventually compress any arbitrary input file to less than a kilobyte. Best of all, both the compression and decompression algorithms are known, and both are very simple.
TimeWinder
09-21-2004, 03:51 PM
Somewhat along the lines of what I suspect Chronos is suggesting, there's also the infinite compression algorithm: The actual algoritm differs for each possible input file, but is basically as follows.
File we want to be able to compress really really well is "X", where X represents some stream of bytes of any length whatsoever. We don't care so much about any other file.
Algorithm: If the input file is "X", then output a one-bit output file (just the bit "1" for example). Otherwise, output the input file (i.e. do nothing).
Viola! We have "perfect" compression, but only for that one file, all others are left alone.
Question hinting at underlying complexity and thus left as an excersise for the reader: What do we do if the input file happens to match the output file for "X"? (i.e. if the input file is just the bit "1"). Hint: answering this is a lot harder than it looks, and basically ends up being a restatement of what ultrafilter said.
Less hard question with almost the same answer: Given that we can find a "perfect" compression algorithm for any given file as above, why can't we achieve "universal" compression by merely somehow numbering all of the possible perfect compressors and representing each file as the number of it's compression algorithm? (Hint: how do we do the numbering?)
Even easier observation: There are conditions under which a variation of this algorithm would actually be very useful; so much so that computer programmers do it all the time without even thinking about it.
I've been thinking about this, and maybe instead of quoting Shannon and other information theorists, instead if we described how some common compression algorithms work in simple terms, then it will become obvious why you can't compress infinitely. I mean, once you've seen something Huffman coded, you understand right away why it wouldn't work a second time, assuming you understand the algorithm.
I'd do this myself, but it's been 13 years since my algorithms class, and I'm 300 miles away from my books.
ultrafilter
09-21-2004, 04:13 PM
I've been thinking about this, and maybe instead of quoting Shannon and other information theorists, instead if we described how some common compression algorithms work in simple terms, then it will become obvious why you can't compress infinitely. I mean, once you've seen something Huffman coded, you understand right away why it wouldn't work a second time, assuming you understand the algorithm.
I'd do this myself, but it's been 13 years since my algorithms class, and I'm 300 miles away from my books.
That only convinces people that the same algorithm won't work twice. It doesn't preclude the existence of bizarre algorithms.
Mangetout
09-21-2004, 04:26 PM
Less hard question with almost the same answer: Given that we can find a "perfect" compression algorithm for any given file as above, why can't we achieve "universal" compression by merely somehow numbering all of the possible perfect compressors and representing each file as the number of it's compression algorithm? (Hint: how do we do the numbering?)As I suspect you are implying; more space is used in the indexing and referencing than is saved, in some cases.
Mangetout
09-21-2004, 04:51 PM
Lets try simplifying the problem to the point of absurdity; compressing a sequence of four bits: there are sixteen possible combinations of four bits:
0000
0001
0010
0011
0100
0101
0111
1000
1001
1010
1011
1100
1101
1110
1111
In order to compress every possible combination, we have to find sixteen unique binary patters that are shorter than four bits, which isn't possible; with three bits you only have eight different ways to say something:
000
001
010
011
100
101
110
111
You might still be able to achieve compression for some data sets, if you use your eight different codes to describe the eight most common four-bit sequences (and this will work well if those sequences happen to predominate in the data set), but how do you encode the other eight possibilities? In fact, what you do is to use only seven codes and reserve one as a control prefix to tell the decompressor that we're dealing with one of the less common items, so you have to do something like:
000 = code for common 4 bit sequence (1)
001 = code for common 4 bit sequence (2)
010 = code for common 4 bit sequence (3)
011 = code for common 4 bit sequence (4)
100 = code for common 4 bit sequence (5)
101 = code for common 4 bit sequence (6)
110 = code for common 4 bit sequence (7)
111 = control prefix for less common sequences, so:
111000 = code for less common 4 bit sequence (8)
111001 = code for less common 4 bit sequence (9)
111010 = code for less common 4 bit sequence (10)
111011 = code for less common 4 bit sequence (11)
111100 = code for less common 4 bit sequence (12)
111101 = code for less common 4 bit sequence (13)
111110 = code for less common 4 bit sequence (14)
Oops! we've run out of combinations again, so
111111000 = code for less common 4 bit sequence (15)
111111001 = code for less common 4 bit sequence (16)
There are actually more elegant (or less wasteful) methods than above, but you get the general idea - if you want to be smart/adaptive and choose the shortest codes for the most common element each time, you have to find a way of telling the decompressor what to expect, which means creating a dictionary header, which takes up valuable room, so you can't even with that way -in describing a file where the distribution is even and no compression is possible (or rather, the gains and losses balance out to zero, or nearly so), you still have to include the dictionary (or yet another unique code to say 'no dictionary this time) so that the decompressor knows where the real data starts and so uncompressible files actually get slightly bigger.
A given size set of data provides the possibility of a calculable number of unique permutations (any of which could occur in the field), a smaller set of data does not have the necessary possible unique permutations to assign to every possible version of the larger set.
vBulletin® v3.7.3, Copyright ©2000-2013, Jelsoft Enterprises Ltd.