Compression Algorithims--I'm clearly missing something

However, given the correct input, and with a compression algorithm with enough catch cases, you could probably compress a file this way if it doesn’t contain, say, too many different 6 character strings.

Of course, this wouldn’t be very good because it would be very difficult to find a file it works with, but in theory you could make a compression algorithm that works very, very well, but only on a very, very small number of inputs.

Well, it’s easy to make a compression algorithm that works very, very well on a very, very small number of inputs. E.g., I can pick four 10GB files of my choosing and compress them down to 2 bits each. And in counterbalance, other files will go up in size. On average, nothing will be saved.

Lossless compression is only possible to the extent that it compresses files on weighted average, weighted by how common those files are. And you can only compress files on weighted average to the extent that their frequencies are not in tune with their coding; that is, to the extent that they were originally stored under some memory-inefficient coding such that some shorter files are less common than some longer files.

I never said it was useful, just that what he was saying was technically doable so long as you keep control of the space of valid inputs.

You’ll likely need a 40GB decompression utility in order to use them again. The combined size of the decompression utility and the 8 bits of compressed data will end up being larger than the original files.

I think you are confusing search trees (either binary trees or balanced “B-Trees”) with data compression. Search trees tend to compress time rather than data, by requiring less effort to locate a data record.

But Huffman encoding uses a binary tree, too. I thought that’s what he was remembering.

I’m looking for investors. I have perfected a compression algorithm that reduces any file, of any size, to ten bytes.

Still working on the decompression part.

Yeah, on further reading that’s what I remember. The main difference between a Huffman tree and a BST seems to be that a Huffman tree only has leaf nodes with values – for ease of parsing reasons.

What a coincidence, I have a decompression utility that will decompress any 10byte file into a picture of a kitten in a teacup.

Ah, alright.

Sure. Though if you let me design the machine architecture, I can get the decompression utility to be a built-in machine code of minimal length as well…

Indeed, every machine architecture defines its own notion of ultimate compression, “The smallest (or any) program that outputs this file”. But this measure will be quite architecture-dependent.

There’s actually malware which is written almost entirely in meta-code. A small sized code snippet that does some often crazy stuff to write the actual malicious part on the fly. This is an example, albeit an incredibly rare and difficult to create one, of a file “compression” that produces exactly one bit-string as output.

This is the absolute nub of compression theory - and it’s the bit people often either overlook or can’t grasp because it’s wrapped up in more complex terminology. There just aren’t enough possible combinations in a small space to be able to describe every possible combination of a larger space.

Ah-HA!!

Thank you for that! I get it now in a way I didn’t before. Really great answer.

That’s nothing, Indistinguishable. I have a lossless compression algorithm that can take any file at all as input, and is guaranteed to never increase the file size, and which can be iterated such that it’ll eventually get any given input file down to an arbitrarily small size.

Kid, I like your idea of storing the file info in the number of times of compressed. It shows spunk. Dammit, it’s just crazy enough to –

But, wait, what do you do if the input is already as short as possible?

If compression is reversible without increasing file size, it must be permuting the shortest possible files. Which means none are left available to send any other files to. Which means you’ll never be able to get any other files down to that size, no matter how many times you compress.

Alas, even this slight claim is still too crazy to work.

Well, then, obviously, you compress a 1-bit file into a 0-bit file.

EDIT: I knew I’d mentioned that scheme on the SDMB before, but I’d forgotten that the previous time was also in a discussion with you. I might have guessed, though.

But what do you compress the 0-bit file into? You can’t send it to itself if someone else (e.g., that 1-bit file) is being sent to it. But then its file size must increase…

That is, there’s a fundamental constraint on injective (i.e., lossless) maps: if no files go up in size, then no files go down in size either. The only way to move some files lower is by displacing some lower files higher.