On data compression

Projammer · June 30, 2008, 9:14pm

Rather than hijack this thread about a 96k FPS game, I’ll drag this thought over here.

ultrafilter pointed out in response that fractals are utilized in some data compression algorithms, and while this is true, I was refering to the entire sequence of digits that comprise the file existing from a singular start point.

He also suggested that the size of the index could be larger than the original file being referenced. This is quite possible for small files, I would think it unlikely for files that are huge(>gig) by todays standards.

ultrafilter · June 30, 2008, 9:22pm

Why do you think it’s unlikely? I don’t see any basis for that belief.

Projammer · June 30, 2008, 9:34pm

How large of a number could you store in a gig of data space?

Assuming base 16…

1 byte = ff = 255
2 bytes = ff ff = 65536
4 bytes = ff ff ff ff = 4,294,967,295
8 bytes = … = 18,446,744,073,709,551,615

So 8 bytes of index gets you a bit over 18 million billion bytes down the data stream. My calculator won’t let me try to work up a 16 byte index, but you get the picture.

Stealth_Potato · June 30, 2008, 9:56pm

The answer is, “about a large enough number to index into a normal sequence far enough to have a good chance of finding a match for a 1GB file.”

As the size of a number increases, the probability of finding that number in a given section of some normal sequence diminishes rapidly. The fact that you can address 4GB of memory with only 32 bits of information is irrelevant.

Say you want to find a match for a 4-bit number within some normal sequence. If you pick any 4 consecutive bits from that sequence, the probability that every one of them matches your number should be about 1 in 2^4, or 1/16. So, your chances of finding your number beginning within 16 places of the start of your sequence are fairly good. To actually save any space, you would need to find it within 8 places, so you could use 3 bits for your index. If you’re unlucky and it takes more than 16, you need to use 5 bits to store the index.

A one gigabyte file is a sequence of 8,589,934,592 bits. Thus, the probability of an arbitrary subsequence being a match would be 1 in 2 to the power of 8,589,934,592, a phenomenally huge number. Even if a match exists within, say, the first sixteenth of that space, after you’ve take a few quintillion years to generate and check that many bits of your sequence, you wind up saving 4 bits of storage. (Since 8589934592 - 4 bits of data can address the one sixteenth the space of 2^8589934592 bits.)

Not a compression algorithm I’d place any bets on, that’s for sure.

Now, if you get astonishingly lucky and find your number within a few million places, what are the odds that the data you’re storing happens to be “digits 42,053,221 through 946,533,410 of pi”?

TimeWinder · June 30, 2008, 9:59pm

For the record: 121121112111121111121111112… is an infinite and non-repeating sequence which does not contain every possible combination of numbers. So the premise is shot from the get-go.

But even so, your own calculations should show why it doesn’t work. Let’s say we want to compress 8 bit files. There are 256 of them, and each one needs it’s own index. What’s the smallest number of bits that can represent 256 different values…well it’s 8.

This little bit of (valid!) circular reasoning is why there exists no universal compression algorithm: i.e. no algorithm that can (losslessly) compress ANY data to smaller than its original size.

Sure, you can store a jillion different starting points in 8 bytes, but there are the same number of possible sequences you could be looking for (2^64), even if you’re only looking for 8-byte sequences – much less gigabyte range ones (in which you’re trying to index 2^(102410241024) different unique patterns with only 2^64 valid places to put them. It actually gets worse the longer your files get–not better.

Stealth_Potato · June 30, 2008, 10:04pm

Psst: it’s spelled “pedantic.”

Half_Man_Half_Wit · June 30, 2008, 10:32pm

Yeah, but if I had infinitely many monkeys computing infinitely many normal numbers…

At any rate, it’s good to know that I can’t be sued for copyright infringement just because I have some program on my PC that could calculate a normal number to arbitrary accuracy.

But what about looking for approximate fits? An image wouldn’t look all that different if a few pixels were the wrong colour, and I’m sure one could invent a rule set that quantized the allowed amount of deviation. It wouldn’t be lossless compression any more, but the probability of finding what you’re looking for would be greatly improved if you’re not looking for one specific string, but either of a couple of thousand similar ones (well OK, with the numbers involved here, a couple of thousand probably won’t cut it…).

TimeWinder · June 30, 2008, 11:21pm

You’re on my list. And it’s not the good list. :smack:

Indistinguishable · July 1, 2008, 12:09am

Still no magic compression; no free lunch of bits. If you want to store enough information to recover X bits of color detail per pixel on arbitrary bitmaps, you’ll have to store X bits per pixel on average.

Besides, you’d hardly need to use π (or a provably normal number) for the achievable aspect of the effect you are describing. You could just downsample your bitmap from 32-bit color to 16-bit color, say, and, presto, half the size at the cost of some pixels changing in color a little.

JPEG compression works very nicely because it’s actually very finely tuned to what matters perceptually in an image file and what kinds of image files are more likely to be stored than others [all lossless compression is based on the fact that some files of a given length are more likely to be stored than others, and correcting this mismatch of codings; lossy compression further augments this by noting that some bit erasures make larger perceptual differences than others, essentially correcting for this misalignment of codings, and then discarding information. In both cases, it’s key that the compression is domain-specific; you couldn’t hope to compress arbitrary strings without knowing what probability distribution they come from (and how it differs from the one which generates strings by coin flips)]. Something like “Well, just take a bitmap, and toss out random bits” won’t be anywhere near as good.

Going back to the larger point, there’s nothing particular special about normal sequences that would make them very good for the purposes of compression; indeed, the statistical properties which define normal sequences almost guarantee that compression methods based on indexing into normal sequences will be of no use at all. Certainly, normal sequences provide no way around the inescapable mathematical fact that you need at least N bits, on average, to store 2^N many strings distinctly.

Indistinguishable · July 1, 2008, 12:24am

(That last line is essentially in reference to prefix-free, or self-delimiting, codings [i.e., ones which aren’t allowed to squeeze extra information out of the presence of an End-Of-File]. Obviously, with non-self-delimiting codings, you can store 2^N strings in less than N bits on average (e.g., by storing 2^N-1 of them as the 2^N-1 strings of length less than N, even though some of these are prefixes of others). Not that it makes any real difference, but the analysis is that much simpler.)

Projammer · July 1, 2008, 2:55am

Reading through this so far it appears that only ultrafilter and Stealth Potato are the only ones who have truly grokked my train of thought.

There is no actual data compression occuring, just the locating a finite string of digits within an infinately long set of digits that can be reproduced at will using a known formula.

Let me link outside here to give an example.

Search Pi is a website that allows you to search for a series of numbers withing the first 200 million digits of pi.

Assume for example that I have a file, no relevence to the actual use, whether graphic, audio, executable, dll, whatever. The actual file as stored on the disk is 14320400638132246584111115775835858113501856904781.

Copy and paste that to Search Pi and the result will be 32768, a number that can be stored in two characters, a much smaller footprint than the original 50 characters. Forgive my jumping between decimal and hexadecimal, it’s moot to the point I’m making.

The only real obstacle I see to this is the sheer amount of computing power it would take to first locate the file as a subset of the infinite data stream, and then again to calculate the sequence back out again. The speed needed would render any encryption algorithm in use or even theorized today to be the equivalent of the daily cryptoquote.

Indistinguishable · July 1, 2008, 2:56am

Are you proposing this as a method of (perhaps only approximately) reversibly turning large files into small files or not?

Indistinguishable · July 1, 2008, 3:01am

Well, ok, but it’s still totally unfair to call “14320400638132246584111115775835858113501856904781” 50 characters if you’re going to call “32768” 2 characters…

At any rate, the point remains: on average, the indices of strings of a given length will not be any shorter than that length. There’s no way around it; it’s just mathematical fact. It doesn’t matter whether you use π or any other infinite string to index into (except insofar as that some infinite strings, including possibly π, may not even contain every finite string).

Projammer · July 1, 2008, 3:05am

That is the net result, but not the actual process. And application data cannot be approximately stored, it must be the same coming out as it was going in or it is useless,

You just gave me a thought for another example to attempt to explain this.

Consider the infinite data set as a universal disk drive, the index is just the pointer to the start of the data needed on that drive. The file isn’t compressed into the pointer in the file allocation table, the pointer just tells you where to find the file.

Indistinguishable · July 1, 2008, 3:08am

[Again, nitpickery, this is assuming we use a representation of indices in which the end is explicitly indicated, so that no index is a prefix of another one. Makes no real difference, but simplifies the analysis.]

TimeWinder · July 1, 2008, 3:10am

Nice example. Too nice, in fact. What we’re saying above is that you ARE specifying a compression scheme: you’re storing a 50 number string in 2 characters. And you clearly picked that number to match it, since your results were too perfect.

I’m going to look for a very simple digit string in pi, using your cite site:
“01234567890123456789012345678901234567890123456789”

Here’s the result:
The string 01234567890123456789012345678901234567890123456789 did not occur in the first 200000000 digits of pi after position 0.

Fair enough, let me try another one:
1111111111111111111111111111111111111111111111111

“The string 1111111111111111111111111111111111111111111111111 did not occur in the first 200000000 digits of pi after position 0.”

I tried a half-dozen other 50-character strings selected for simple patterns and at random – none were found. What I didn’t do is what I suspect you did: look up what was at a given position, and use that string as an example.

That’s like saying I can compress any word to “third character of Moby Dick”, so long as the word is “Ishmael”. It’s true, but it doesn’t generalize.

What we’re saying is: there’s no free lunch. For every 50-byte numeric string that can be located with a two-character offset, there are hundreds of trillions that cannot: some will be 2 characters, some will be 3 characters, and most will be hundreds of characters (i.e. worse than the original). We’re saying that because of the information issue above, the BEST you can hope for is that the average offset for a 50-number string is 50 numbers long. In fact, since Pi is not random, and the number patterns in it are not randomly distributed, you’ll do much worse on that on average.

Projammer · July 1, 2008, 3:22am

TimeWinder:

Nice example. Too nice, in fact. What we’re saying above is that you ARE specifying a compression scheme: you’re storing a 50 number string in 2 characters. And you clearly picked that number to match it, since your results were too perfect.

I’m going to look for a very simple digit string in pi, using your cite site:
“01234567890123456789012345678901234567890123456789”

Here’s the result:
The string 01234567890123456789012345678901234567890123456789 did not occur in the first 200000000 digits of pi after position 0.

Fair enough, let me try another one:
1111111111111111111111111111111111111111111111111

“The string 1111111111111111111111111111111111111111111111111 did not occur in the first 200000000 digits of pi after position 0.”

I tried a half-dozen other 50-character strings selected for simple patterns and at random – none were found. What I didn’t do is what I suspect you did: look up what was at a given position, and use that string as an example.

That’s like saying I can compress any word to “third character of Moby Dick”, so long as the word is “Ishmael”. It’s true, but it doesn’t generalize.

What we’re saying is: there’s no free lunch. For every 50-byte numeric string that can be located with a two-character offset, there are hundreds of trillions that cannot: some will be 2 characters, some will be 3 characters, and most will be hundreds of characters (i.e. worse than the original). We’re saying that because of the information issue above, the BEST you can hope for is that the average offset for a 50-number string is 50 numbers long. In fact, since Pi is not random, and the number patterns in it are not randomly distributed, you’ll do much worse on that on average.

Oh jeez. Busted. Take me away detective. :rolleyes:

I expected it to be obvious that I copied out the first 50 characters starting at 2^15. It was an example. 200 million is a decidedly small data set to pull 50 characters out of. Of course I grabbed something that was there.

So 1111111111111111111111111111111111111111111111111 didn’t occur in the first 200 million, how about the first 200 billion? Trillion? Trillion^Billion?

TimeWinder · July 1, 2008, 3:23am

Let me try an example: Let’s not use Pi as our source. Let’s use a number string we generate by something like a triangulation rule:

01234567890001020304…09101112…9899000001002…

(basically every one digit number, then every two digit number, then every three digit number , etc).

This is the same as your pi example, but it has two properties that pi doesn’t:

every number is guaranteed to be in it, which isn’t true of pi.
It’s obvious (isn’t it?) that almost no number’s index is going to be shorter than itself, and after the first ten or so, they’ll all be longer on average - and the index grows much faster than the numbers.

But there exceptions to property 2: numbers like “23” which occur earlier in the string because they’re made up of consecutive values of shorter strings, and hence appear earlier than you’d expect (the string itself, for example, has index 0). Your example above was basically “pick 23.” Which is well and good, but doesn’t tell us anything about 00000, for example.

Indistinguishable · July 1, 2008, 3:24am

What exactly is that you’re hoping to do, Projammer? It looks like you’re hoping to do something impossible, which is to design a method of assigning strings a representation which is shorter on average.

Let’s demonstrate the problem with universal compression as simply as possible:

Consider the numbers 0 through a hobajillion. You want to assign to each of these a distinct index. You want those indices to be as low as possible. What’s the best you can do?

Well, you’ll have to use a hobajillion different values. The lowest you can go, then, is by assigning the indices 0 through a hobajillion, in some permutation. But that means, on average, you’ve saved diddly-squat; the sum of the lengths of the inputs is equal to the sum of the lengths of their indices. Some numbers might go to much lower indices, and some might go to much higher indices, but on average, the very best you can do is not compressing at all.

It doesn’t matter what you try, whether it involves π or some other fancy scheme. There’s no getting around this fundamental theorem of pigeonholery. This is a problem that can’t be solved. The only thing you can do, is to forget it and try to solve another problem. And that’s where domain-specific compression comes in (compression targeted to the mismatch between some data’s actual distribution and the way it is normally coded). But universal compression? It’s like a perpetual motion machine, only even worse. We could, with some small chance, be very wrong about the laws of physics, but this? No, this is robust, in the ultra-strong 2 + 2 = 4, numbers don’t lie, mathematical proof sense.

Indistinguishable · July 1, 2008, 3:29am

Do you want to snark or do you actually want to learn the mathematical facts about your proposal (a very frequently proposed, and debunked, idea)? TimeWinder was just trying to helpfully point out the misleading aspect of your example, in order to keep you yourself from being misled into interpreting it as providing evidence for something it actually does not.

Topic		Replies	Views
Non-lossy compression implications Factual Questions	21	1504	April 4, 2007
Can compression algorithms be applied successively? Factual Questions	31	1548	September 21, 2004
Compression Algorithims--I'm clearly missing something Factual Questions	36	4769	August 7, 2012
What texts or images have we found in pi? Factual Questions	66	10322	May 29, 2014
New Original Pure Math Question Factual Questions	9	1311	August 2, 2000

On data compression

Related topics