Compression Algorithims--I'm clearly missing something

Jragon · August 7, 2012, 1:21am

iamthewalrus_3:

This example involves taking a 6-hex-character string and encoding it as a 3-hex character string. There are a lot more 6-character strings than 3-character strings.

So, either

You’re only using a small subset of the possible 6-character strings (small enough that there are few enough possibilities that each one can be encoded using just 3 characters). In this case, there’s a lot of duplicated data in your set and you probably can compress it well.
-or-

You’re going to “overflow” your short string supply, and you’ll have to start using longer strings. Given the overhead required for compression, you eventually end up getting a larger file than you started with.

Point #2 is pretty easy to demonstrate with common compression algorithms. Take a file and zip it. Then zip that into another zip. Continue. At some point (probably around the 2nd or 3rd iteration), the files start getting larger.

If the above isn’t clear, let’s go back to binary. How would you encode all 2-character binary strings (00, 01, 10, 11) as 1-character strings (0, 1)? This is analogous to your question above. It’s pretty clear that you can’t do that, because there are 4 2-character strings and only 2 1-character strings.

However, given the correct input, and with a compression algorithm with enough catch cases, you could probably compress a file this way if it doesn’t contain, say, too many different 6 character strings.

Of course, this wouldn’t be very good because it would be very difficult to find a file it works with, but in theory you could make a compression algorithm that works very, very well, but only on a very, very small number of inputs.

Indistinguishable · August 7, 2012, 1:50am

Well, it’s easy to make a compression algorithm that works very, very well on a very, very small number of inputs. E.g., I can pick four 10GB files of my choosing and compress them down to 2 bits each. And in counterbalance, other files will go up in size. On average, nothing will be saved.

Lossless compression is only possible to the extent that it compresses files on weighted average, weighted by how common those files are. And you can only compress files on weighted average to the extent that their frequencies are not in tune with their coding; that is, to the extent that they were originally stored under some memory-inefficient coding such that some shorter files are less common than some longer files.

Jragon · August 7, 2012, 2:12am

I never said it was useful, just that what he was saying was technically doable so long as you keep control of the space of valid inputs.

TriPolar · August 7, 2012, 2:33am

You’ll likely need a 40GB decompression utility in order to use them again. The combined size of the decompression utility and the 8 bits of compressed data will end up being larger than the original files.

eschereal · August 7, 2012, 2:35am

Jragon:

Isn’t there a popular compression algorithm that uses something like a tree? I remember learning about something a couple years back where you’d have a tree like
      []
     /  \
   0      1
  /         \
 [A]         []
             /\
           0   1
          /      \
        [C]     **
        
Where the most common bit-strings (or in this case, letters) were on higher leaf nodes. So then the string “bac” (011000100110000101100011) would be represented as 11100.

Or is that just a simple implementation of Huffman coding? On review they seem pretty much the same.

I think you are confusing search trees (either binary trees or balanced “B-Trees”) with data compression. Search trees tend to compress time rather than data, by requiring less effort to locate a data record.

pulykamell · August 7, 2012, 2:37am

But Huffman encoding uses a binary tree, too. I thought that’s what he was remembering.

TonySinclair · August 7, 2012, 2:38am

I’m looking for investors. I have perfected a compression algorithm that reduces any file, of any size, to ten bytes.

Still working on the decompression part.

Jragon · August 7, 2012, 2:41am

Yeah, on further reading that’s what I remember. The main difference between a Huffman tree and a BST seems to be that a Huffman tree only has leaf nodes with values – for ease of parsing reasons.

What a coincidence, I have a decompression utility that will decompress any 10byte file into a picture of a kitten in a teacup.

Indistinguishable · August 7, 2012, 3:29am

Ah, alright.

Sure. Though if you let me design the machine architecture, I can get the decompression utility to be a built-in machine code of minimal length as well…

Indeed, every machine architecture defines its own notion of ultimate compression, “The smallest (or any) program that outputs this file”. But this measure will be quite architecture-dependent.

Jragon · August 7, 2012, 3:33am

There’s actually malware which is written almost entirely in meta-code. A small sized code snippet that does some often crazy stuff to write the actual malicious part on the fly. This is an example, albeit an incredibly rare and difficult to create one, of a file “compression” that produces exactly one bit-string as output.

Mangetout · August 7, 2012, 7:21am

This is the absolute nub of compression theory - and it’s the bit people often either overlook or can’t grasp because it’s wrapped up in more complex terminology. There just aren’t enough possible combinations in a small space to be able to describe every possible combination of a larger space.

Fenris · August 7, 2012, 11:32am

Ah-HA!!

Thank you for that! I get it now in a way I didn’t before. Really great answer.

Chronos · August 7, 2012, 2:49pm

That’s nothing, Indistinguishable. I have a lossless compression algorithm that can take any file at all as input, and is guaranteed to never increase the file size, and which can be iterated such that it’ll eventually get any given input file down to an arbitrarily small size.

Indistinguishable · August 7, 2012, 4:38pm

Kid, I like your idea of storing the file info in the number of times of compressed. It shows spunk. Dammit, it’s just crazy enough to –

But, wait, what do you do if the input is already as short as possible?

If compression is reversible without increasing file size, it must be permuting the shortest possible files. Which means none are left available to send any other files to. Which means you’ll never be able to get any other files down to that size, no matter how many times you compress.

Alas, even this slight claim is still too crazy to work.

Chronos · August 7, 2012, 7:56pm

Well, then, obviously, you compress a 1-bit file into a 0-bit file.

EDIT: I knew I’d mentioned that scheme on the SDMB before, but I’d forgotten that the previous time was also in a discussion with you. I might have guessed, though.

Indistinguishable · August 7, 2012, 7:59pm

But what do you compress the 0-bit file into? You can’t send it to itself if someone else (e.g., that 1-bit file) is being sent to it. But then its file size must increase…

Indistinguishable · August 7, 2012, 8:06pm

That is, there’s a fundamental constraint on injective (i.e., lossless) maps: if no files go up in size, then no files go down in size either. The only way to move some files lower is by displacing some lower files higher.

Topic		Replies	Views
How does file compression work? Factual Questions	38	2434	August 3, 2002
How does winzip work? Factual Questions	8	1524	November 17, 2000
How does data compression work? Factual Questions	4	1318	September 25, 2001
File compression: the general case Factual Questions	61	2272	November 2, 2001
New Original Pure Math Question Factual Questions	9	1311	August 2, 2000

Compression Algorithims--I'm clearly missing something

Related topics