Eh, most real-world compression programs won’t, in fact, work for “decompressing” random noise, because the compression process adds in a slight bit of extra entropy for purposes of error detection and correction (which you can do very efficiently, if you know what you’re doing). You could leave that step off, of course, but to do that with off-the-shelf programs, you’d probably need access to the source code, which you may or may not have.
I saw the thread title and thought you meant the film π and the cryptic messages the protagonist received. Like that dream of a brain in the subway, or finding the Go pieces arrayed on the board in a Fibonacci spiral.
Beginning at digit 39,674,512,221,681,735,723, it says, ISYN: “Jeffrey Epstein didn’t kill himself”.
Given that the digits of Pi are infinite, non-repeating and normal, doesn’t that mean that “any” given word, sentence, or entire book are represented in there somewhere? In fact, all words ever written, in any and every order must be in there somewhere. Including this post and also what is written next! Such is the nature of infinity.
The best algorithms for English text compression are about 88%. Let’s see, I was guessing 1.3 bits per byte but maybe 7 bits would be better. That’s 80%. So I’m happy with a ratio for something in that range for syntactic only algorithms. But if you go with predictive meaning methods (What words comes next: “As happy as the day is ____.”) the ratio would be even greater. Phenomenally slow compressing, but higher ratio. And in that case it would only make my argument stronger.
The digits of Pi are not random. Not remotely random. A really good example of a non-random infinite sequence.
(Note: as a researcher in algorithmic information theory, my go-to definition of randomness is based on the size of code to generate the string, cf. Kolmogorov. Random strings have programs about the size of the strings. The shorter the generating program, the less random the string. Printing out a googol of 0’s doesn’t require much code.)
You can describe the first x digits of Pi with a short program that generates Pi to x digits and x written in base 10. That’s a log size description of X digits. Allow arrow notation or some such and you can get amazing compression.
Hmm… I bet you could do a whole lot better than PKZIP, which I believe just looks byte-by-byte for patterns, compressing repeated zeros very efficiently, but not accounting for the kind of varied but constrained patterns in English text. Two bytes, only enough for two to three letters, is enough to index a decent-sized dictionary, so even without anything more sophisticated we can get almost every word down to two bytes, with a few bits to spare for capitalization and such. I’m pretty sure that’s better than PKZIP would do on English text at one byte per letter.
Wikipedia says it is NOT known if pi is normal. So it is possible pi does not contain every text.
It impresses me that the complete exact text of Gibbons’ Decline and Fall of the Roman Empire is in the digits of pi if you go out far enough. Before you get to it, of course, you’ll encounter zillions of versions with a misspelled word or where Septimius never killed Emperor Didius Julianus. You’ll also found an exact copy of Decline and Fall of Western Civilization published 2361. Again there will be alternate versions, but now no way to guess which is correct.
Indeed. The Library of Borges contains no information.
ftg, I think you’re misunderstanding the figure of compression programs being “88%”. I’m pretty sure that that means that the size of the input decreases by 88%, in other words that the output is 12% the size of the input. Certainly even simple off-the-shelf programs do better than the output being 88% the size of the input: Nobody would bother with them at all if that was the best they could do.
Obviously. The output of the Mersenne Twister isn’t random either. Obviously what I meant is that such pseudo-random sources will appear to be random uniform to, e.g., decompressors.
I wonder which compressor you’re speaking of. Most will have some redundancy due to constraints, but the only error-correction I’m aware of in a method like Mpeg, will be resynching after lost data.
Just now I tried a simple experiment (compressing the verses of the Douay-Rheims bible!) and discover that zip, gzip (based on Lempel-Ziv coding) and a simple implementation of Burrows-Wheeler compression ALL get almost the exact same compression: 2.4 bits per character.
This figure is better than the number Shannon gave in his early paper for simple machine methods, but not as good as he achieved using human-predictor in the compressor loop! (But it’s hard to compare because Shannon and others often assume a 27-sized character set, while typical English textfiles distinguish upper/lower-case, etc. and use about 80 characters.)
ETA: Lempel-Zif text compression uses a stunningly simple but effective compression method. The idea behind Burrows-Wheeler compression is downright amazing!
If you tune the model, you can get a lot better compression. A custom version of PAQ cuts 1 GB of English Wikipedia down to about 13% of its original length, 100 MB to 15% (note that includes the size of the program…), which is basically what ftg was saying and what you can achieve if you don’t mind ridiculously slow compression and a very special-purpose algorithm.
So far as I know, PKZip includes deliberate redundancy for purposes of error correction. There used to be a sister application called “PKZipFix” that would (sometimes) take a mangled .zip file and correct it so that it could be unzipped, and it’s hard to see how that would work without error-correction redundancy.
And MPEG is, I would think, a completely different topic, because it’s lossy.
MPEG does not provide an approximate encoding of an exact video. It provides an exact encoding of an approximation to a video.
I suppose that’s true, in the sense that different MPEG players will all produce the exact same decompression, and if that decompression happened to be the exact video you were starting with, then it would compress it losslessly. A fine distinction, to be sure, but then, that’s kind of what we’ve come to expect here.
Then again, by that standard, I don’t think there are any compression routines, for any medium, that produce “an approximate encoding of exact media”.
But strongly suspected to be; we’ll take it as a given for purposes of discussion.
This misconstruction almost seems snarky, or an accusation that mine was snark. In fact mine was a terse reminder that your final sentence in #52 was non-sequitur in context. (If this isn’t clear we should move the subtopic elsethread.)
I’ve published in the field. There was absolutely no misunderstanding. I took the number straight from charts online comparing compression methods. “88%” means it is 88% smaller. “0%” means no compression.
If I was so stupid to not get this then the comparison to a Shannon-like number would have been even stupider. In addition I mentioned higher ratios as being greater compression.
And you thought I misunderstood this???
My apologies, ftg. Your estimates went from 1.3 bits per byte to 7 bits per byte, suggesting some radical change in the percentage you were using. I know you’ve worked a lot in information theory, but I thought it possible that you had less experience with English language compression specifically.
Also my apologies to septimus; I didn’t mean to imply that I regarded pedantry as a bad thing. We expect pedantry here, and that’s why I’ve stuck around for two decades, because I like pedantry.
Thanks for the apology, Chronos. Rereading my comment I see it was much too harsh, so let me apologize to you also.
I’ll add that I’m not a fan of pedantry (though I’m sure many of my posts come across as pedantic). My “MPEG does not provide an approximate encoding of an exact video. It provides an exact encoding of an approximation to a video” was intended as useful insight, not as pedantry.
With a simple experiment just now, using Firefox’s Jpeg renderer, I see that my comment might have been misleading. I display a large Jpeg image via Firefox, after first overwriting one of the jpeg-file’s bytes with garbage. I did this only a handful of times, so the results are not significant, but I’ll report them anyway:
When I replace a byte arbitrarily in the middle of a Jpeg file, sometimes the image completely disappears, sometimes there will be a visual flaw over a large part of the image, and sometimes there will be no effect discernable to my eye! New bytes of ‘00’ seem likely to be benign; ‘FF’ (denoting a marker) catastrophic. I assume the renderer often detects a failure and scans forward for the next resync marker.
Huh, based on my understanding of how JPEG works, I would have thought that any defect introduced by a changed byte would be limited to a single 8x8 pixel square. I suppose that I might have learned an older version of the encoding, and it’s now more sophisticated.