However, given the correct input, and with a compression algorithm with enough catch cases, you could probably compress a file this way if it doesn’t contain, say, too many different 6 character strings.
Of course, this wouldn’t be very good because it would be very difficult to find a file it works with, but in theory you could make a compression algorithm that works very, very well, but only on a very, very small number of inputs.
Well, it’s easy to make a compression algorithm that works very, very well on a very, very small number of inputs. E.g., I can pick four 10GB files of my choosing and compress them down to 2 bits each. And in counterbalance, other files will go up in size. On average, nothing will be saved.
Lossless compression is only possible to the extent that it compresses files on weighted average, weighted by how common those files are. And you can only compress files on weighted average to the extent that their frequencies are not in tune with their coding; that is, to the extent that they were originally stored under some memory-inefficient coding such that some shorter files are less common than some longer files.
You’ll likely need a 40GB decompression utility in order to use them again. The combined size of the decompression utility and the 8 bits of compressed data will end up being larger than the original files.
I think you are confusing search trees (either binary trees or balanced “B-Trees”) with data compression. Search trees tend to compress time rather than data, by requiring less effort to locate a data record.
Yeah, on further reading that’s what I remember. The main difference between a Huffman tree and a BST seems to be that a Huffman tree only has leaf nodes with values – for ease of parsing reasons.
What a coincidence, I have a decompression utility that will decompress any 10byte file into a picture of a kitten in a teacup.
Sure. Though if you let me design the machine architecture, I can get the decompression utility to be a built-in machine code of minimal length as well…
Indeed, every machine architecture defines its own notion of ultimate compression, “The smallest (or any) program that outputs this file”. But this measure will be quite architecture-dependent.
There’s actually malware which is written almost entirely in meta-code. A small sized code snippet that does some often crazy stuff to write the actual malicious part on the fly. This is an example, albeit an incredibly rare and difficult to create one, of a file “compression” that produces exactly one bit-string as output.
This is the absolute nub of compression theory - and it’s the bit people often either overlook or can’t grasp because it’s wrapped up in more complex terminology. There just aren’t enough possible combinations in a small space to be able to describe every possible combination of a larger space.
That’s nothing, Indistinguishable. I have a lossless compression algorithm that can take any file at all as input, and is guaranteed to never increase the file size, and which can be iterated such that it’ll eventually get any given input file down to an arbitrarily small size.
If compression is reversible without increasing file size, it must be permuting the shortest possible files. Which means none are left available to send any other files to. Which means you’ll never be able to get any other files down to that size, no matter how many times you compress.
Alas, even this slight claim is still too crazy to work.
Well, then, obviously, you compress a 1-bit file into a 0-bit file.
EDIT: I knew I’d mentioned that scheme on the SDMB before, but I’d forgotten that the previous time was also in a discussion with you. I might have guessed, though.
But what do you compress the 0-bit file into? You can’t send it to itself if someone else (e.g., that 1-bit file) is being sent to it. But then its file size must increase…
That is, there’s a fundamental constraint on injective (i.e., lossless) maps: if no files go up in size, then no files go down in size either. The only way to move some files lower is by displacing some lower files higher.