How does file compression work?

Nah, you’re safe. You’d only be breaking laws, if you exported pre-compiled code.
-Pete

My favorite compression trick is to make a ridiculously huge (around 500 megs or 1 gig) image that consists of nothing but white space. Then compress it to a 2 meg jpg. When you view it on a website, it decompresses, and makes your computer grind to a halt.

Wait. Does that really work? I don’t know how display of compressed images works, but the image buffer shouldn’t hold any more than whatever the screen size is, right?

No, the image buffer is completely happy to try to hold an entire image.

You’ve taken the classic definition and written it backwards.

If it compresses, then it is defined to be non-random. (In this field of study.) This is such a simple statement I cannot see how it can appear to be circular.

(Despite the use of “If”, this is a definition, so non-compressible means random. But you have to resort to Kolmogorov-type definitions to state it right.)

Regarding the output of a compressor and randomness. (Cf. AmbushBug.) It is only a limited form of pseudo-randomness and not nearly of good enough quality to compare to one time pads. Keep in mind that the everyday compression schemes most people are familiar with trade in compressibility for time. Who wants to wait a couple years to compress an image to 95% less space? Hence, a zipped file could be compressed further (and therefore is not random), but it could take a method that might use substantially more time. Keep in mind that NSA, etc. have “time” (in the sense of computer cycles).

I can’t find any good online resources on Kolmogorov complexity, but when I get home I’ll post what little info I have, just in case anyone’s curious.

Let x be a string. Let |x| denote the length of x (i.e., the number of characters in x).

The Kolmogorov complexity of x, denoted K(x), is the length of the shortest program/input combination that produces x as output.

How does this relate to compressibility? Well, a string x is said to be c-compressible if K(x) < |x| - c for some constant c. If x is not 1-compressible, then x is incompressible. That’s what we’re calling random here.

You can get to a version of their original web page at
http://web.archive.org/web/20010719111638/http://www.zeosync.com/

I had another look at it and laughed at the crap almost as hard as the first time I saw it. I remember news reports giving it credibility which just goes to show you what reporters know.

Yeah, in particular the main claim can’t exactly be found guilty of understatement.

Hey, they didn’t say how big the disk would have to be, now did they?

When I was a computer science student in college one of our assignments was to write a text file compressor. The method I used was pretty straightforward. Normally every letter and punctuation mark is represented by a byte of memory. I simply did frequency analysis on the file, then assigned the commonest characters smaller bit strings. The most common letter would be represented by two or three bits, the least common ones by maybe as many as ten. Do this conversion to all the characters in the file and append the conversion dictionary to the end. Abracadabra, the file is smaller.

What you’re talking about is a Huffman encoding, and I think I could come up with a text file that would get larger under Huffman encoding.

You certainly could, but that text file would probably not be well-formed English. For instance, the string “abcdefghijklmnopqrstuvwxyz” would not be compressed by Huffman, but the Gettysburg Address would be.

Also, an optimally-compressed file is generally not indistinguishable from random noise, depending on what you mean by “optimal”. A zip file, for instance, includes some extra redundancy for the purpose of error correction. This only adds a logarithmic number of bits to the file (i.e., the number of added bits is of the order of the log of the number of original bits), so it doesn’t hurt your compression much, but it makes the message instantly distinguishable from noise (noise doesn’t usually come with a bunch of convenient checksums at the end).

Yeah. So?

I realize that forgottenlore wasn’t claiming that I couldn’t do that, but I wanted to make sure everyone realized it was possible.

I basically got romped in GQ for some of the absurd statements I made in the last thread I remember on compression, so let me say that ignorance was fought, even if it was a bitter fight. But, I still have some questions.

For a compression routine, can it be known ahead of time exactly how many combinations will not compress (given a definite string length or less)?

Exactly? No. It would take some time to explain, complete with a detour into the Busy Beaver Function and the Amazing Omega Constant.

Roughly could be figured out as long as you like approximate guesses. For typical compressors, the easiest answer is “only a tiny fraction” where “tiny” is an ever decreasing amount as the string length increases. E.g., 2[sup]n/c[/sup] out of 2[sup]n[/sup] for some constant c>1. That may look like a lot but it’s an insignificant percentage for large n.

[symbol]w[/symbol] is the probability that a randomly chosen program will halt upon a randomly chosen input, correct?

A related question I’ve wondered about: Is it possible to replace JPEG, PKZIP, GIF, etc.–all of them–with some new theoretical lossless technique that achieves maximum possible compression on all sets of data, or will it always be that certain techniques work better with different sets?

No compression technique can achieve the maximum possible compression for each file. See [73] in the second part of the compression faq I referenced earlier for an explanation as to why not.