I had posted a lengthy response in this thread early on, mostly in reply to Bytegeist’s “interesting” first post. But it went the way of the hamsters.
I dreaded typing it all back in, etc. But it appears the signal to noise ratio has finally recovered (ignoring the ternary hijack).
At this point I want to add just a couple of things:
Here’s the comp.compression FAQ. It explains a lot of the basics, makes fun of earlier “we can compress everything!” companies, etc. Some of the posters to this thread really need to read it.
While some compression methods tack on headers:
Not all do, e.g., many forms of RLE. You can do LZW without headers if you choose.
Even without headers, some files will get larger.
Note that you can’t iterate a lossy compression system. If you feed a jpeg into a jpeg compressor (dressing it up as an image first), the losses incurred will make recovery of the (single compressed) image impossible. All bits in the compressed image have become vital.
For things like jpeg compression, there is a quality setting (mistakenly taken as a “percentage” value by too many people). Set it to 75 and you get a nice approximation of the original. Set it to 20 and it’s probably going to look crappy. Set it to 5, you get a tiny file that is not going to look like the original. (Setting it to 100 means you don’t have a clue.) For audio files, you can tweak the sample rates and ranges. Etc. But it’s one pass.
These kind of companies have been popping up for a long time. A friend of mine in graduate school was brought in to consult for such a company. He knew it was a joke inside of 5 minutes. They were very unhappy with his analysis and adamant that they really could do it. That was the early 70s folks.
Why does the press give free advertising for these clowns? There is no journalism in American Big Media anymore. No reporter actually questions anything they are told.
A real compression tool wouldn’t use characters, probably, and might treat the data as a contiuous binary stream, or something else; the control ‘character’ in my example, had to be the first one in the file (i.e. it is part of the header, not the data), which is why you need Y or N. Also, I assumed (for the purposes of the example) that the entire data would be either compressed or uncompressed; if you want mixed segments in a single file, you’d have to do something like:
-Use segments of a fixed size (which means that the break points probably won’t be where you’d find them most convenient in terms of compressibility/or non)) - in this case, you won’t achieve the best possible compression simply because your compressible segments will contain uncompressible sequences and vice versa.
-Use a control ‘character’ that does not otherwise occur in your compressed data (this won’t happen naturally, so you have to make it happen by encoding all of the compressed and uncompressed data in such a way as that the control character doesn’t occur accidentally) - in this case, you’re essentially ‘wasting’ the potential of using the control sequence at every step.
-Give the file a header that explicitly maps the size, position and compession methodology of each segment - probably the best solution all round, but the map takes up space even if the entire content of the file is an uncompressible sequence.
The advantage of the map method is that you can analyse your segments and use a different compression methodology for each one, as appropriate - I think this is what the zip format actually does.