Whatever happened to Mega Data Compression

If the compression program isn’t smart enough to compare the output with the input, you may end up with a “compressed” file that is larger than the original. Of course, a smart program would reverse itself or stop in time to avoid that ridiculous outcome. Some prgs aren’t that smart.

Assuming lossLESS compression only for a moment, let’s reduce the problem to an absurdity.

Compress a bunch of data. If it gets smaller, do it again. Now if this could continue for a long time, you will get the whole thing reduced to a single bit. Since any lossLESS compression allows the original to be reconstructed exactly, that means that two or more different data strings could be reduced to the same single bit. Therefore, the premise that compression can keep recompressing the same string perpetually is wrong.

What you will be able to do is reduce the data string to the smallest possible string that contains sufficient information to exactly reconstruct the original. Anything smaller than that would not be fully reconstructable.

Another way to look at it is say you have two different strings. After maximum compression, they would still have to be at least one bit different. If they were identical, you couldn’t reconstruct both from the same data!

No, the original statement didn’t mention overhead, but it is inescapable; after compression, you don’t actually have the data any more, you have a description of the data - in ideal cases, the description is more concise than the data itself, but for any given compression algorithm, there will be a type of data that cannot be described as concisely as the data itself; given that the algorithm is still committed to describing the data, even if the description is “I couldn’t compress this, so here it is verbatim” (so that the decompression algorithm can sensibly work on it), the description adds at least a tiny bit to the overall size.

If you listed out every possible 100 byte file and tried to compress them all, you’d notice that more of them got larger than got smaller. It’s easy to think compression always works well, though, because the files people actually care about and collect on their hard drives (text, sound waveforms, bitmap images, programs, databases, etc.) are full of easily exploitable patterns.

It is not difficult to understand. Suppose you want to compress a string of n bits. There are 2^n possible strings and if all of them are just as valid and possible then they are incompressible. There is no way to represent a random n-bit string with fewer than n bits. What you can do is re-map them so that those which are probable (useful, whatever) end up in the low region but that means displacing some from the low region to the high region. In the end, in order to represent all possible 2^n strings you will need 2^n bytes and if you have moved some of the higher ones down lower then you have to move some of the lower values up higher. No way around it. In the end, in order to represent all values you still need n bits. Not one more but not one less. No way around it.

That should be “in order to represent all possible 2^n strings you will need n bytes”.

The counting argument is even simpler: there are 2[sup]n[/sup] bit strings of length n, and 2[sup]n[/sup] - 1 bit strings of length less than n. If the compressed output is to be unique, not every string can be made shorter.

Well, that’s pretty much what I was saying but the corollary is that for every string you map to a shorter string you are mapping a shorter string to a longer string in the same proportion.

By the end of the day you have that with a length of n bits you can represent 2^n different strings. If they are all equally probable (random) then no matter what compression algorithm you use it cannot compress that data globally because some strings will be shortened but others will be lengthened. After transmitting large numbers (>n) of random strings of length n you cannot achieve any compression, no matter what you do. For every string you shortened you had to make another one longer and by the end of the day you have not achieved any compression.

Compression relies on knowing something about the data. In other words, the data does not carry as much information as it might appear at first. Therefore, compression algorithms depend heavily on the type of data and there is no such thing as a universal algorithm which can compress all and any type of data. In a way the decompression algorithm carries information as well as the data it decompresses. It knows something which, if it didn’t know it could not decompress. There is no such thing as abstract compression of random data. There are only specific methods of compression for specific types of data.

We are talking of non-lossy compression, obviously. Lossy compression is a different kettle of fish because it begins by losing information which cannot be recovered later.

I just tried a quick little experiment, giving an 8 MB file of random bytes to Gzip for it to chew on. The “compressed” file ends up over a thousand bytes larger than the original — which isn’t bad if you consider it as a percentage, but is still a lot worse than I would have expected. Some of that overhead I know consists of: a 3-byte header identifying the file format and the method of compression, including “none” as a possibility; a 4-byte checksum in the trailer; the original file’s name and timestamps. Perhaps the Unix user and group names are also in there, and other attributes for reconstructing the original file. Still, the overhead seems excessive to me.

The same experiment with Zip gave similar results: about a kilobyte of overhead. With Bzip2 however, it’s much worse: about 25 KB of overhead! I have no idea what Bzip is putting into all that packaging. To be fair, under normal circumstances, Bzip usually performs much better than its rivals.

These programs are all based on lossless compression algorithms. However, there’s something I misstated in my previous post about lossy algorithms, which might be more relevant to the OP. (Or not. Relevancy isn’t my strong suit.)

It is possible for a lossy compression algorithm to guarantee a smaller output for every input. It’s lossy after all, and therefore has a license to abbreviate or approximate the input data. I doubt whether specific algorithms like MP3 or JPEG actually make such guarantees, but it’s at least mathematically possible for them to do so. My guess is that if you feed a “white noise” or very small audio/video stream to these compressors, they’ll fail. But I’ll leave that experiment to someone else.

By the way, I googled up a page or two on ZeoSync’s 15 minutes of fame. Most of the branching links are broken, but it’s still an entertaining read after all this time.

This is true but misleading. Lossy compression is not only compression but involves loss of information so. . . yeah, you can make anything smaller by cutting chunks of it off. I wouldn’t call that compression though.

BTW, I am amused that someone who goes by the handle rsa is not an expert in information theory. Maybe the handle is not a reference to rsa?

Consider one of the most primitive compression methods; Run Length Encoding (which worked well for 16-bit per pixel bitmaps with large areas of solid colour); suppose your data consists of:

aaaaaaaabbbbrrrrrrrraaacccccaaaaaadddddaaaabbbrrrrrraaaa
(56 bytes)

In its simplest implementation, RLE can compress this down to:

8a4b8r3a5c6a5d4a3b6r4a
(22 Bytes)

but if your data consists of:

abracadabra
(11 bytes)

RLE will express it as:
1a1b1r1a1c1a1d1a1b1r1a
(22 bytes)

If you implement a scheme that skips over uncompressible sequences, you need to add a control header to denote whether the segment has been compressed or not, so the first sequence then becomes:
Y8a4b8r3a5c6a5d4a3b6r4a
(23 Bytes)

and the second sequence becomes:
abracadabra
N(12 bytes) - better than before, but still bigger than the original data.

There are a lot of different compression methodologies, many of which are much more complex than RLE, but they all do essentially the same thing - they describe the data and there will always be a set of data which cannot be described in less space than the original.

Oops, that second fromj last section should read:

and the second sequence becomes:
Nabracadabra
(12 bytes) - better than before, but still bigger than the original data.

I can’t remember the exact details, but I came up with a text file that would triple in size when you ran a Huffman encoding on it. I do remember that there was one of the first character, and each subsequent character was twice as frequent as the one before it, but I don’t remember how many different characters it had–either 26 or 36.

The interesting bit was that it could be rearranged so that it would compress down to almost nothing under RLE.

It comes down to this: with n bits you can encode 2^n different combinations. Therefore it takes n2^n to encode all possible combinations. It is impossible to find any algorithm which can encode all the combinations in anything less than the original number of n2^n bits. Shortening some comes at the expense of lengthening others. No free lunch. What compression does is shorten those which are useful (and used) at the expense of those which are not useful (and probably not used). Since what is useful depends on the type of information the algorithms are dependent on the type of information. If all input strings were equally useful/probable then compression would be useless.

Just a question, only tangentially related to the question at hand (but I don’t think it warrants another thread.) Is it possible to design a circuit on a tertiary structure, so rather than having two states to every bit (on & off, ie. 1, 0), you can have three states (on, half, off)?

Maybe.

The problem you run into is that circuits run from 0-5 volts, and you need enough buffer between the different values that there are no ambiguous signals. I want to say that 0 is 0-1.2 volts, and 1 is 3.8-5 volts, but I don’t remember.

You can also run into problems extending the standard boolean operations to three values. What’s ~(1/2)?

Well, yes, it is possible. The question is whether it serves any useful purpose.

In fact, as a time-killing distracting exercise I recently designed a simple circuit which does that, more or less. It is an alarm loop which relies on the loop having a certain resistance so it will detect three conditions:

  1. High volt = Open loop =alarm
  2. Mid voltage = resistance in line = Everything OK
  3. Low voltage = short circuit = alarm

Systems which have more than two states (3, 4 etc) have been used for memory systems because you can encode more than one bit per sample but they are not useful for computing because binary logic is just simpler to implement.

This is my tangential tie-in with the OP. Boolean operators would be useless on a three-state circuit, and offhand, I can’t think of any possible tertiary logic that would be as useful as your standard set of and/or/not/xor/etc… But something like data encoding seems great… In 8-bit binary, you get 256 possible bit combinations. In 8-bit tertiary, you’d get 6561. Question is, though, how difficult is it to implement, and would this type of archeticture require a significant amount of extra space?

Since you say they have been implemented in memory systems, how do they perform in the real world? What are the upsides and downsides of this kind of design?

Going back the my RLE example above (which the reader should understand is rendered in grossly simplified human-readable form), a natural question arises:

Why do we need the N at the start of the uncompressed sequence? Can’t we just make do with the Y for the compressed sequence and imply the N in cases of its absence?

You could do that, if it weren’t for the possibility that your algorithm was asked to compress the string 'Y1p9f2h4t" - since this wouldn’t compress at all by RLE, your program would throw it into the file verbatim (not bothering with an N), but because the initial character of the natural (uncompressed) data just happens to be the control character that tells the decompression algorithm to get busy, your data comes out the other end as ‘pfffffffffhhtttt’

Surely more of a problem is that with only one control code, you can’t tell where the end of the compressed section is?
eg:
aaabbbrrrracadabra would become Y3a3b4rNacadabra

Incidentally, how are you distinguishing between the control character Y and the Y in the data? (I know this is only a simplification, but what characters would a real compression tool use? Presumably it would have to be something that never appears in the files you’re compressing?)

There have been ternary (base-3) computers in the past, though the idea seems to have had its day.