Non-lossy compression implications

It’s trivially obvious that there is no such thing as an always effective lossless compression algorithm, and moreover that the average compression of random data will be zero.
My question is: what are the implications?

Not sure what you mean by “implications”, but there’s some math in the wikipedia article here.

That is true.

That isn’t true. Just because a number is random doesn’t mean that it has no patterns or repeated sequences that can be compressed. Random just means that any patterns or repeat sequences can’t be predicted in advance. Once we have the data we can compress those repeat sequences and patterns readily.

For example the results of flipping a coin are truly random and might appear as

HTHHHTHTTTTHTHHHTHTHHHTHTTTTHH

If we allow that

!n = n is repeated 3 times

and

@n = n is repated 4 times
then I can compress that 30 character string into a 21 character

HT!HTH@THT!HT!HTH@THH

That’s a 30% compression of random data using the simplest compression algorith possible.

Because random data is random it would be improbbale that there were never any repeats and never any patterns that could be readily compressed. Exactly what the compression ratio would be is entirely depedent on the type of random data, but on avergae it will always be greater than zero.

Implications for what? Space flight? Genetic engineering? The baked good industry?

Or worse, of course.

As for practical implications, not too much. Most of the files that you’re ever likely to encounter are sufficiently non-random that they can be compressed very well, even losslessly. The only files you’re likely to encounter which don’t compress well are those which are already compressed (with or without loss), and even there, in the worst case scenario, the resulting “compressed” file will be only a little bit bigger than the original: The compression program will append a header to the effect that “this file is already as compressed as it can get, and there’s nothing more I can do to it”, and just leave the original as-is.

Except that in your example you need two more characters to express the sequence. If you’re storing the sequence as ASCII then you’ve compressed the data, but if you stored it as binary you could store H as 1 and T as 0, whereas in ASCII you need 7 binary digits per character.

So it might be more efficient to have more “letters” where each letter represents a particular string, it could also be more efficient to have as few letters represented by as few bits as possible.

Yes, but for any given compression scheme, the average compression is zero. Consider sets for the uncompressed and compressed strings, which will form a bijection between each other. So for every string that is compressed by a given algorithm, there is necessarily one that is made longer.

Anything!

Blake, while your example shows that some sets of random data can be compressed, it does not disprove the claim you quoted, which is that the average compression, using any given algorithm, of random data, will be 0% or worse.

As Lemur866 pointed out, your compression scheme actually takes up more space even on the input you gave, since it requires more symbols.

To answer the OP: mostly, the implications are that you need different compression algorithms for different purposes. Luckily, most of the data we want to store isn’t random.

Edited to add: you introduced new symbols in your examples above, which is not fair - we need the same alphabet for the compressed and uncompressed strings.

By definition, a number is random iff its binary representation can’t be compressed by 2 or more bits. The study of this sort of thing is known as algorithmic information theory, or Kolmogorov complexity*, and there’s a decent introduction over at Wikipedia.

The implications for everyday life aren’t too deep, but the philosophical implications are. One of the big theorists, Gregory Chaitin, has made a very strong case that randomness does exist, at least so far as arithmetic goes.

*Strictly speaking Kolmogorov complexity is a subfield of algorithmic information theory, but it seems that the two are equated in common usage.

It’s all encoded in 0s and 1s anyway, so there’s really no issue with adding as many symbols as you like.

Sure, but that would negate his point.

Actually, is this true? I understand the average compression cannot ever be more than zero, but I think it could be less - if it simply isn’t a very good scheme.

Good point. In fact, presumably any compression scheme where ANY string of bits is not a valid thing for decryption presumably has an average compression of less than zero!

You got compression there just because you were extending the alphabet. But now try the same thing in binary: You’d have to use some set of bits to encode ! and @, and then you’d need some way to distinguish between using that set of bits to encode one of your control characters, from using it to represent itself. For instance, one might use “11” to encode “H”, “00” to encode “T”, “10” to encode “!”, and “01” to encode “@”. But then your 21-character “compressed” string inflates to 42 bits (two per character), compared to the 30 bits needed for the original (1 per character). Cryptoderk was correct; no compression algorithm can ever average better than no compression, given random data.

I’m still not getting this. If we have along enough string of random data then we will inevitably get duplicated sequences and arithmetic patterns that are amenable to compression. The idea that “you’d need some way to distinguish between using that set of bits to encode one of your control characters, from using it to represent itself” seems like a red herring, It’s quite true, but it seems like it would be as simple as having a string somewhere that says “compressed data starts here”. Yes, that string takes up space sitself, but for a sufficiently large block of data it becomes an insignificant part of the compression.

To give some idea what I mean, it’s usually assumed that pi is irrational, which to me means that the number sequence is random. I lifted this 832 bit sequence of pi off a webapage:

0011001100101110001100010011010000110001001101010011100100110010001101100011010100110011001101010011100000111001001101110011100100110011001100100011001100111000001101000011011000110010001101100011010000110011001100110011100000110011001100100011011100111001001101010011000000110010001110000011100000110100001100010011100100110111001100010011011000111001001100110011100100111001001100110011011100110101001100010011000000110101001110000011001000110000001110010011011100110100001110010011010000110100001101010011100100110010001100110011000000110111001110000011000100110110001101000011000000110110001100100011100000110110001100100011000000111000001110010011100100111000001101100011001000111000001100000011001100110100001110000011001000110101001100110011010000110010001100010011000100110111001100000011011000110111001110010010000000100000

~

By simply taking some 20 digit repeating sequences and replacing them with a letter I can convert that sequence into

0011001100101110001100010011010000110001a001000110110001101010011001100110101001110000011100100110111b001000110011001110000011010000110110001100100011011000110100001100110011001100111000001100110011001000110111001110010011010100110000001100100011100000111000001101000011000100111001001101110011000100110110b1001b011100110101001100010011000000110101001110000011001000110000001110010011011100110100001110010011010000110100a001000110011001100000011011100111000001100010011011000110100001100000011011000110010c01000110000001110000011100100111001c01000111000001100000011001100110100001110000011001000110101001100110011010000110010001100010011000100110111001100000011011000110111001110010010000000100000

690 binary digits + 7 ASCI characters = 739 binary digits

Even if we want to include the decryption “code”:
00110101001110010011 = a
00111001001100110011 = b
001110000011011000110 = c

that still only adds around 85 binary digits for a reduction from 842 down to 824 binary digits.

And that compression didn’t make any attempt to identify all repeating sequences, I simply selected the first 3 that a very simply macro could isolate. I’m sure given more complex data analysis I could find literally a dozen more sequences in there that repeat and that are over the 10 binary digit size need to make compression effective.

This is why I can’t understand this idea that random data can’t be compressed. For sufficiently long strings of random data it seems like compression isn’t a problem simply because some sequences will inevitably be repeated, and those repeats can be compressed. And as the string gets longer and longer the string will be repeated more and more often making even a clumsy compression attempt such as the one here more and more efficient. If we spun pi out for a million digits instead 832 we can be sure that the “a” sequence would repeat more and more often, making compression ever greater.
Even allowing that headers and so forth take up space, at some point the data has to be compressible simply because at some point any random string will start to occur more than once, it can’t do otherwise. Once it repeats sufficiently to compensate for the size of the headers and so forth then the data is compressed.

I’m sure there’s something really obvious I’m missing here, but I can’t quite see what. It seems to follow inevitably from the following facts:

Random data are, well, random sequences.

Any random sequence must repeat parts of its own sequence given sufficient time simply because there are a finite number of permutations for any size sequence. For instance in there are only a finite number of ways that you can arrange 20 binary digits and once we reach a certain string length one of those permutations must repeat.

Any sufficiently large part of a sequence that repeats can be represented by a code that is smaller than the sequence itself. I’m not sure if this critical size is 10 binary digits or 10, 000 but at some point it must become efficient to represent a repeated sequence as a code rather than as the sequence itself.

Any sequence that is represented by a code that is smaller than itself has been compressed.

Ergo at some size random data has to become compressible.

Even if we need to be able to represent a million places of pi by a single code to achieve effective compression we know that at some point that sequence of a million numbers will re-occur, and as soon as it does we have achieved compression of random data. We might need to calculate pi to 10^ 1, 000, 000 places to see any million digit sequence repeated, but at that size we will be able to compress pi.

Can someone point out what obvioust point I’m missing here? I can undertand that you can’t compress random data as it is being generated, but once its generated it seems that at some point there must be sufficient repeats to allow compression…

As previously pointed out, try mapping the uncompressed to the compressed sets.

Consider all strings up to an arbitary length k. Then either:

Each string in the uncompressed set maps to a string in the compressed set. If any of these compressed elemenets is shorter than the uncompressed element, then one of the comrpessed elements has to map to an element that is longer than the uncompressed element. There is no other way that it can fit when you consider that it is reversible.

Well yeah, but so what? Provided that the compressed element maps to mutiple uncompressed elements as in my example above it doesn’t seem to matter one bit, you stil achieve compresson.

It’s reversible because you repeat the same arithmetic operations mutiple times. You dont; need to “fit” those oparations to anything, you only need to perform them at the right point.

As your block of random data gets longer, it becomes increasingly likely that your “compressed data starts here” marker will show up in the data itself. That’ll cause your algorithm to fail.
If you try to get around that by increasing the number of bits per data element to the point where you can find a marker that isn’t already in the uncompressed data, you also increase the apparent randomness of your representation of the data. That results in poorer compression.

The problem is that having headers and shortening codes adds a lot of overhead into a random file that you never recover. e.g. for your scheme

  1. How do you differentiate your codes “a” etc from your data in the file
    answer
  2. Have a header e.g. “111111111” that says the data after this is a letter not the sequence of pi
  3. But then what do you do when pi comes up with the header sequence naturally - how do you tell the difference?

What you need to do is write out the binary for your abbreviated pi (with headers and abbreviations etc) and show us how a programme can decipher it unambiguously and that it is shorter.

Then I will patent it and make a billion dollars :slight_smile:

Sure, you can do that, but then you have a lossy compression scheme. For comparison, here’s the ultimate compression scheme: I take any binary string as input, and compress it into the output string “1”. Sure, that one compressed element maps to multiple uncompressed elements, but it compresses everything really efficiently.

Of course, there is a middle ground. With some types of data, it’s possible to lossily throw out most of the information, while still keeping the important information. A JPEG image, for instance, can be much smaller than the uncompressed version of the image, and still look almost the same to the human eye. But there’s still information lost, there.

And pi is absolutely not random. As another current thread discusses, it’s quite possible for an irrational number to have a pattern to it. I could write a computer program which is only a few kilobytes long, but which would print out pi to a googol decimal places (if you gave it enough time). That’s a huge amount of compression.

[boast] I remember doing something similar in the 60 s using fortran (only to 1,000 places - we didnt have enough computer time or memory back then) [/boast]

However to do so I had to set up several 1,000 place arrays to store the intermediate figures. (can it be done without it?) That would have to be counted in your overhead. To say otherwise would mean that I could just use the symbol “pi” to represent an infinite number of digits - surely the ultimate compression. However it is not a very useful compression procedure