What is the limit of applying compression algorithms?
I mean, if algorithm A reduces file size by 10%, and algorithm B by 3%, could these algorithms be applied alternately to a file to keep reducing its size indefinitely?
What is the limit of applying compression algorithms?
I mean, if algorithm A reduces file size by 10%, and algorithm B by 3%, could these algorithms be applied alternately to a file to keep reducing its size indefinitely?
Nope.
Different algorithms have different ways of compressing, and therefore can compress by differing amounts. But in general, applying the same algorithm twice doesn’t produce any significant shrinkage in most cases (and actually makes the file bigger in a lot more cases–an important fact about compression). If your run a file thru algorithm A and it compresses 10%, and thru B and B compresses it further by 1%, then algorithm A wasn’t that good in the first place. It is quite likely that you if just ran algorithm B alone you would have done better.
Read the sci.compression FAQ for more info.
If you have a burning desire to be well known as the Internet’s Biggest Idiot, post to sci.compression that you have an amazing method to compress any data without loss repeated.
Asusming non-lossy compression schemes, and assuming you wish to eventually recreate the original data, think about it for a minute. If you could re-apply compression schemes to the same data, and each time your output was smaller, you could eventually reduce any block of data to a single byte. But why stop there? Why not a single bit?
Therefore, logically, a single byte or bit cannot represent ALL possible large data sets and this scheme will not work.
In a practical sense, yes, you can re-apply compression algorithms, but the amount of space saved becomes less and less. Eventually, it may expand, and the processing time may become an unwelcome factor anyway.
Such are the laws of the universe. There ain’t no free lunch, as they say.
No. Compression algorithms take advantage of order and redundancy in the source data to produce an output that uses less space. Statistically, the output has lost most of the order and redundancy that were present in the source data. It looks more like random noise. A compression algorithm will not reduce the size of a file containing random noise.
There is no such thing as a compression algorithm that will always reduce the size of a file. This can be proven mathematically. Compression algorithms are designed to work on the subset of files that share certain statistical properties.
No. Otherwise, you would be able to reduce a large file to one bit.
In more detail, compression essentially reduces the repetition in the file. It finds symbols that can stand in for longer sequences of data, and the emits a file with a dictionary of symbols and then a sequence of those symbols corresponding to the original information. Therefore, if the first algorithm did its job, there (ideally) shouldn’t be any repitition in the data anymore, making further compression pointless.
The methods of finding those symbols in the most efficient way and representing them in the most compact form are, to some extent, still being developed, with new and different algorithms coming out every so often. For example, the traditional *nix compress utility has been replaced by the gzip program, and gzip is slowly being replaced by the more efficient bzip2.
Plus, compression is an interesting problem that can provide grist for endless computer science courses and a corresponding amount of research papers and testbed implementations and other such products of academia. It’s so interesting because the basis of the theory, the idea of finding patterns in arbitrary data, is so close to what the human mind does all the time. In other words, we are all compressing our experiences into ideas, words, and conceptual frameworks, because if we could not, we would be swamped by an excess of input.
The Wikipedia article on Data Compression is a good jumping-off point. But try Google and other sources of information, there’s a lot of info out there.
(Hm. I wonder if it’s possible to compress simulposts. ;))
Not really. The performance of a particular compression algorithm depends on the type of data you’re dealing with, and the intended results of the compression. Text data can be compressed losslessly pretty well, because there are lots of elements and patterns that are repeated. You can substitute them with small replacement identifiers and change them back later. This type of compression does not work for data with more randomness.
There’s also the question of lossy versus lossless compression. When compressing a book or piece of scientific data, you don’t want to lose everything. Every letter or number is important. When compressing images or sounds, however, some loss of fine detail may be acceptable in exchange for the performance benefit. That is how formats like JPG, and MP3 compress images and sounds from their raw formats. Once that data is lost, it can never be re-created. (Unless you keep the original, of course.)
So, in short, different types of compression are desinged for different problems, and how they perform depends on whether you’re using them for the right job.
Damn, that was a helluva simulpost.
I don’t remember the math form university but there is a limit to how much something can be compressed. I will try to give an example as best I can.
If we could apply A Then B then A then B then A and so on and get more compression each time we could then get a file that is exactly 1 byte long. The problem here is a byte has only 256 values. As soon as we have compress 257 different files down to 1 byte we will have at least 2 files that compressed to the same value. When more then 1 file compress to the same value how are we to know what that value is suppose to expand too?
I hope that helps. Other wise I am sure some one will come along to explaine it better then I can.
I think this is one of the questions you can be sure will draw out all of the geeks on a given message board.
Hey I resembled that remark.
I think I’d have to see some evidence of testing before I made my mind up.
Evidence of what? That you cannot reduce a file to one bit and the successfully regenerate it? That sometimes a compression program will increase the size of a file?
This sentence desperately needs context.
About a year or so ago (maybe two years), there was a guy making bold claims about a new compression methodology that would surprise everyone because it could compress anything at all much more efficiently than anything we currently have.
Of course I know this is bullshit, but I’m curious to know exactly when, how and where his scheme fizzled out - was he just baiting investors?
There’s no need for evidence, cause we have proof.
Think of it this way: a file is nothing more than a sequence of n bits, and there are 2[sup]n[/sup] possible files of length n. There are 1 + 2 + 4 + … + 2[sup]n - 1[/sup] files of length less than n, which adds up to 2[sup]n[/sup] - 1. So there are fewer files of length less than n than there are of length n. Furthermore, roughly 75% of those files are at most 2 bits shorter than n.
Any lossless compression algorithm (i.e., one that allows you to completely extract the original information) needs to map distinct inputs to distinct outputs, so it’s going to map 75% of all its inputs to files that are basically the same length. Since there aren’t enough shorter files, it also has to leave at least one file at at least the same size.
Now, that’s ignoring the issue of compression headers, which tell you how to uncompress data. They take up space as well, and are usually more than 2 bytes. So even with an ideal compression algorithm, 75% of all inputs will get larger.
Real compression algorithms are specialized–they work really well on one kind of data (text, sound, video), but not so hot on others.
You’re probably thinking of ZeoSync, although there are others.
While it is impossible to compress truely random data losslessly, if you’re willing to accept loss there is no limit to what can be done!
That’s the fella! - actually, on reading it now, I can’t believe anyone even took it seriously for a moment; it’s very obviously insincere technobabble.
Well, not really: The most-used algorithms are generalists, because you want to standardize on one as much as you can to simplify things. And the compression algorithms usually used on sound, images, and video are lossy, which would be pretty stupid to apply to text or compiled programs or the like.
So there might be an algorithm optimized for text, or optimized for the machine code of a specific chip (for a real example of this, consider the Thumb opcodes for ARM chips), but they won’t gain much marketshare because people really don’t like to switch algorithms all the time.
Well, there’s always the famous compression sort transform, which can compress an arbitrary bitstream to ridiculously small files.
(The only problem is that there’s no known de-compressor…;))
Ultrafilter:
I don’t agree that this constitutes proof, although i’m sure you have a point.
Obviously we have established that files can’t be infinitely compressed, but I don’t think we’ve established exactly why.
It is well known from information theory that the expected number of bits that you will need to exactly represent a file is related to the amount of randomness in the source (i.e., the entropy). If you want a formal proof of this, it is known as the “source coding theorem” by Shannon. The technical proof is detailed, but not terribly difficult. I would reproduce it here, but I don’t know how to do the fancy math notation on SDMB. Check out Cover & Thomas pg 86 or Blahut pg 51 (both classic information theory texts).