On data compression

Even if I grant you that it’s there at all, in ANY number of digits, what we’re all basically saying is:

You cannot have an index that’s always shorter than the string it indexes for all strings. There’s just not enough indices.

I’m not sure which part we’re getting hung up on. Is it:

a) The proof that that it’s mathematically impossible to have a compression scheme that always results in a smaller string than the input one for all possible inputs, or

b) That what you’re proposing is mathematically equivalent to (a)?

Here’s a simple example. Pi’s post-decimal string is: 14159265358979323846264…

I want to locate the number “7”, which we’ll agree I can write in one digit, yes?
Using your method, the number would be “12”, which takes more. The same is true of “8”, whose representation is “10”, and zero, which doesn’t appear in this fragment at all. This will just get more and more common as we look for longer digit strings.

Continuing in TimeWinder’s vein, just try it for yourself, Projammer. Generate random strings (say, of length 9; your Pi searcher goes up high enough to handle the experiment definitively on these), feed them to that Pi searcher, and see how many of your randomly generated strings end up coming out with shorter indices.

You’d be extraordinarily lucky to get even one hit. Just see for yourself.

Maybe a snark, also a knee-jerk reaction to a statement of the obvious that I had picked my sample from the existing data set. It was obvious. Apologies if needed.

Cite?

I’ll also apologize for using compression in the title as it seems to have had the effect of totally derailing the thread.

The data within the file is not compressed, it is just externally referenced as a subset of an infinite string of data.

There is no need to assign each position an indice, they are indexed by virtue of their position. Nothing need be stored to know that the 100th character index is 100, you just start at the first character and count up. Thus the need for incredible amounts of processing power to be able to recover the billion billionth character in a reasonable time.

There is no high order math in use here, just brute force calculation of pi, fractals, sqr(2), or whatever (reference set) to however many positions is needed to discover the data sequence you’re trying to store.

As for the size of the index exceeding the size of the data. I seem to have erred in referring to it as an index. It is an offset into the fixed infinite data set. There is nothing in and of the offset to indicate what it is referencing. It is just a number.

As I stated earlier, 8 digits(hex) of offset covers all of the first 18 and a half million billion characters of the reference set. Adding just one more digit adds 16^17(295,147,905,179,352,825,856=295 million billion) more positions deeper into the reference set. And every digit added expands the depth accessable exponentially.

The chances of matching 100 characters from a 9 digit offset in that sample? Pretty good I’d think. We just don’t have the time to prove it one way ot the other.

The mixing of “hexidecimal,” “digit,” and “character” is getting confusing. Can we agree to limit this to decimal interpretations everywhere? Otherwise there’s this extra 16/10 factor floating around that makes the numbers ugly.

Besides, 8 hex digits only gives you 4.2 billion numbers, anyway, not the 18.5 million billion you’re claiming. That sort of several millionfold error here and there can make a difference.

Let’s switch to decimal, please, since our source string is. Expressing our offset in ten decimal digits gives us 1 billion offsets (which, BTW, is the same as an index), close enough to the 4.2 your hex digits will give us.

Now, you claim that we don’t have time to figure out what the chances are of finding a hundred-digit number with a nine (I’ll use ten) digit offset.

Well, as it so happens, I’ve got the time. There are 10^100 different hundred-digit numbers. That’s a million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million different numbers.

You have 10^10 offsets available to find it with, so that’s the maximum number of these that you can represent. So the odds of a given 100 digit number appearing in the 1 billion digits available is 1 in 10^90, (one in a million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million, million).

See, that didn’t take much time at all, once my pasting-finger healed.

And all of this ignores the point that someone made earlier: the number with offset 0 and the number with offset 50, for example, have half of their sequence exactly in common, since they overlap. So the actual number of unique sequences is much smaller.

Please trust me when I tell you that the numbers get WORSE as the strings get longer, not better. (For example, the odds of finding a given digit with a single-digit offset are about one in ten, for two, they’re about one in 100, and so on).

I would agree with him. Point in fact, I would say that on average, the size of the index would be exactly the same length of the file. The odds of getting that exact combination of data is 1 over two to the file bit length’s power, and so that would be the average index length into a random bit field.

You’re confusing things like the fact that an MD5 hash can uniquely identify every file we’ve ever tested. That’s because a) we just can’t create and test enough files to cover the field space, and b) most files aren’t random data, so the compressibility factor makes them fill a less wide range of the total field space than random data would.

When I use the word “compression”, I use it simply to mean the proposal that there are two functions from strings to strings, Compress and Decompress, such that Decompress(Compress(x)) = x and Compress(x) is generally smaller than x. You are, without a doubt, describing compression in this sense, and my arguments have all been regarding compression in this sense.

Nothing in my arguments had to do with “storing” anything anywhere. What I’m pointing out is that there aren’t enough positions to do what you want; your method does not provide effective shortening.

Please, respond to either one of the following two arguments:

  1. Can you, without “cheating”, come up with so much as a single 9-digit string whose position in pi is shorter than it? Just give this a try. If you can’t, I submit it would appear your scheme is doomed. [Feel free to replace “9” with other large numbers of your choice, if you feel that would make it easier]

  2. Suppose Decompress and Compress are mathematical functions with the properties above. Take the hundred shortest strings and stick them in pile A. Compress the hundred shortest strings, and go toss the results into pile B. Since Compress has to produce distinct values on distinct strings, at best the things in pile B are the hundred shortest strings. But that means, at best, the average length in pile B is equal to the average length in pile A. Thus, no lossless compression scheme can do better, on average, over the hundred shortest strings, than simply leaving them uncompressed. Feel free to replace “hundred” throughout with the number of your choice.

We have sufficient time to prove it one way. TimeWinder has offered the proof dealing with this directly. I have offered you a more general proof that your method doesn’t do what you apparently think it does. Can you point out any flaw in these proofs? Or would you like to more narrowly circumscribe the claims you are making, to make it clear that they do not run afoul of this proof?

Serves me right for the snarky comment about several-millionfold errors. I got carried away on the pasting – the two numbers above should have about a third as many "million"s in them. But there are enough left to make the point, I think.

Think of it this way:

If I have a 1-bit long file, to match every single possible combination of bits that the file could be, I need a bit field at least 2 bits long: One bit for 0 and one for 1. So what is the maximum offset? Well to access two slots, I need the offsets 0 and 1. I can represent that with one bit, so my index needs to be 1 bit long. This is the exact same length as my file.

If I have a 2-bit long file, to match every single possible combination of bits that the file could be (00 01 10 11), I need a bit field like this: 00110 What is the maximum offset? 3 How many bits do I need to represent an index value from 0 to 3? 2-bits! Exactly the same!

If I have a 3-bit long file, to match all the combinations (000 001 010 011 100 101 110 111), I need a bit field like: 0111000101 What is the maximum offset? 7 How many bits do I need to represent an index value from 0 to 7? 3-bits! Exactly the same!

Regardless of the length of your file, the index you need to find it–even given a bit field that is specifically made by compressing down all bit combinations for a needed for a file of that length–is going to be the same length as the file itself. With purely random data as your bit field, on average you will probably need more than the length of the file since the data won’t be minimizing duplicate patterns in your search length.

“On average” be damned. Even with a perfect bit field you need an index the length of the file. At minimum you need an index this big, I should have said. And with random data, you could make the index double the size of the file and it would be perfectly possible that the sequence you are looking for is one offset further into the bit field that your index can access. So there really is no safe index size to use with random data, just a diminishing likelihood of buffer overflow.

Let’s illustrate this visually. There’s a street full of houses, all on the same side of the road; House #1 at the very west end, then, moving east, House #2, House #3, ad infinitum (or ashighasyouwantum). There’s are also a hundred people in front of them, the last survivors of the apocalypse. They wear shirts with numbers on them. In front of House #1 is Person #1, in front of House #2 is Person #2, and so on through House #100 and Person #100. Only problem is, Person #N doesn’t generally live in House #N. Rather, the number of Person #N’s house is determined by, say, some method involving the position of N in the digits of π. Very well; when I blow the whistle, everyone will go home. Ready? toot

Oh, look at that. It’s pandemonium. Ok, let’s try this again from the start, but we’ll do it in a more orderly fashion. Ok, if anyone lives in House #1, please go home by swapping positions with the guy currently standing there. Thank you. If anyone lives in House #2, please go home by swapping positions with the guy currently standing there. Thank you. If anyone lives in House #3, … Thank you. If anyone lives in House #100, please go home by swapping positions with the guy currently standing there. Thank you.

Hm. Ok, maybe there are some leftovers now, people who don’t live in Houses #1 - #100. Fine. If anyone hasn’t gotten home yet, just go. Hike east up the street to the larger-numbered houses till you reach home. Thank you.

Good, everybody’s settled in. What just happened to get everybody home? Well, at the beginning, people just swapped positions; any time one person moved west, another person moved the same amount east. The center of mass of humanity as a whole didn’t change at all during this part.

After that, maybe, there were some people left over, all of whom had to hike east to get home. The center of mass of humanity as a whole, if anything, shifted east during this part.

Conclusion: If anything, the center of mass of humanity shifted east during this exercise.

Translation: If anything, the average number is increased by moving to its position in π.

Rather, to rephrase that last line:

Translation: If anything, on average, moving from a number to the one corresponding to its position in π is an increase.

[I have glossed over the details of how one is to distinguish between “1”, “14”, “1415”, etc., all of which have the same position in π. You could clearly accommodate for such things one way or another, but it doesn’t really matter how you work it out, since nothing in the argument really depends on what scheme you’re using anyway]

This is a question of compression, even though it doesn’t seem like it. Indexing within a procedurally-generated sequence is a recognised compression methodology.

Here’s another way to see why it can’t work for all possible inputs (same reason any compression algorithm can’t compress every possible input):

Suppose you have the ten-digit input string 1234567890, and your compression/indexing methodology finds a way to reduce this to a nine-digit index value: say, 987654321

If the methodology is truly universal, you should be able to feed that nine-digit index string back into the machine and reduce it to less than nine digits.

Indeed, for an algorithm capable of compressing any data, it should be possible to keep repeating the process until any input was packed down into a single digit.

Except that it’s obvious that could never work. You cannot express all possible combinations of two digits with a one-digit index; you cannot express all possible combinations of three digits with a two-digit index, and so on - there just isn’t enough room in the index reference string to describe all possible things it’s supposed to be indexing, if those things are longer than the index itself.

To be fair, Projammer understands that not all inputs will get shorter; e.g., clearly, “7” gets longer (its position being 13, if the first position after the decimal point is position 1). But Projammer still thinks it’s worth it overall or for large inputs or most of the time or some such thing. That’s why I’ve tried to emphasize that even the average size change is in the direction of larger, not smaller.

Hey, why don’t we reference arbitrary data by citing its location in the Library of Babel? :slight_smile: