Exactly my first thought. I forget what message was found, but when I don my thinking beanie (with iridescent propeller) I wonder why anyone would hope to find messages in contemporary English or any human language, modern or ancient. I’d expect a message from The Creator(s) Of This Universe to enclose patterns of high geometry, possibly with blueprints of spacetime twisters. Decode Pi and reality is ours! Can we get a grant?
I thought he was just poking fun.
I’m sure it was in fun but I’m pretty sure those are real tools.
I mean obviously I googled what a trigraph is, and I understand that, but I don’t understand why you’d have volumes of them sitting around at your fingertips, what you’d use them for, or why one volume would be better than another volume.
This sounds vaguely like a thing I should maybe learn and use for a project I’m doing, but I wanted to ask septimus what he’s doing to get a better feel for it.
What about number sequences in Pi? Can one find 0123456789 somewhere in the first ten million digits?
A trigraph is pretty much what the name says - a triple of letters. Using them to compress text means that you break the text up into three letter groups, and then encode/compress based upon using codes denoting the trigraphs, not individual letters. Given most language doesn’t include useless trigraphs like aaa or qpz and so on you can compress the text. You could sort the trigraphs by frequency, and use Huffman encoding or similar. But there are many variants possible. The trick is that your coding depends upon the text. So if you use a coding based upon a given book, you will expect random decoding to mostly give you back what is, for the most part, random garbling of that book. No surprise it looks sort of recognisable in places.
The King James stuff is misleading, since (I thought everybody knew this?) God speaks Late Latin, not English. So you would take your Biblia Vulgata and count all three-character sequences, so you know their relative frequencies. Then, given a number (let’s say between 0 and 1), arithmetic decoding with respect to this model will produce a sequence of letters+space conforming to those statistics. The trick is to get the details of the encoding exactly right, otherwise you will see gibberish instead of the secret message, and be doomed to an endless cycle of lives until you finally figure it out.
Not every word is composed of trigraphs though, some are (making up some terms) digraphs or monographs or tetragraphs right? Like I think French might even have some pentagraphs. So the algorithm must be taking this into account somehow.
Takes closer to 20 billion digits. That sequence shows up at position 17,387,594,880.
We should count “space” as a character, so, for example, your post could be chopped up as
Not| ev|ery| wo|rd i|s c|omp|ose|d o|…
I don’t know if that’s what he actually did, though.
E.g., maybe it would be more interesting to consider trigraphs but add one letter at a time:
Not|ot |t e|…
In the novel version of Contact, the message Ellie Arroway found buried deep in pi was
actually simple low geometry, just an image of a circle, encoded in base 11. Just enough to prove the universe had a creator.
I didn’t buy the bit that claimed the message was easy to notice in other bases besides the intended one. I think that any message would look like random noise in any other bases except multiples of 11, so the search would have to specifically compute different bases to look for non-normal distribution of digits.
Well, not only that, but it only makes a circle when the digits are put in an array of a specific width. Use a different width and it would either be distorted or not recognizable.
So clearly, the corollary question is: how far into pi do we find a reasonable sized circle graphic? Coding can be anything easily expressed. Maybe a simple bitmap of binary coding would do. Row width can obviously be arbitrary (and no doubt part of the deep secret.)
We can easily derive a metric of how far we might expect any defined sequence to take to find, but how far it is is another matter. 31 trillion digits has an even chance of holding an arbitrary pattern of about 45 bits. That isn’t a very big graphic. A pattern 7 pixels on a side. But that would be enough to have some fun with. We have additional wiggle room by choosing the row width. Gets us a few more bits of pattern, but not many if we stick to sane widths.
Be the first to find your chosen symbol in pi.
IIRC, the pattern Arroway found was a sequence of nothing but 1s and 0s, with a length equal to the square of a prime number, so there was only one way to put them into a nontrivial grid. And it wasn’t obvious in other bases: She was just searching in multiple bases, including base eleven.
You made me go look up the text of the novel. I’d forgotten she did indeed tell the computer to compute other bases, but then the part that stuck with me was a few pages later:
That implied to me the claim that something odd would be noticed in other bases besides 11.
Any scheme to derive letters from pi is going to be somewhat arbitrary. Instead of base-26, use base-10 treating ‘01’ - ‘26’ as ‘A’ to ‘Z’? Arbitrary. What about ‘27’ - ‘52’ BTW? Do these wrap around for another copy of ‘A’ to ‘Z’? Would that be cheating? Why not use a UTF encoding that would find text in various languages in the digits of pi?
With all this arbitrariness, it seemed reasonable to also use some arbitrary compaction code. I went with simple trigraph statistics. Like the schemes proposed by others, this choice was arbitrary.
I’ve color-coded three specific questions to be answered.
(1) In the King James Bible, among all the occurrences of ‘th’, that digraph is followed by ‘e’ 105,650 times, by a space 19,753 times, by the letter ‘a’ 15,488 times, and so on, all the way down to ‘g’ 2 times (both in the name ‘Ramothgilead’). This table will show the probabilities of a digraph turning into each specific trigraph. One way to mimic the Bible is to make the trigraph statistics of the random output text mimic those of the Bible.
Sure, I’d get even better results using quadrigraph statistics. But my laptop’s memory isn’t even big enough to run Firefox properly — What are you, a sadist?
(2) Of course works other than the Bible could have been used to train the trigraph statistics, as I demonstrated by doing the same experiment with Shakespeare’s Complete Works. Which dataset do you want me to use? The Unabomber Manifesto?
(3) There are many out-of-copyright books that can be downloaded quite easily. I much prefer to read print, but the machine-text has advantages. For example, did you know that the word ‘happiness’ appears twice in the Douay-Rheims bible (not even counting the notes, nor the unfortunate ‘unhappiness’ in Psalms 13:3) but nowhere at all in the King James Bible. :eek: One ‘happiness’ is in 2 Maccabees which King James didn’t bother to translate, but the other is right there in Genesis 30 verse 13:
D-R: And Lia said: This is for my happiness: for women will call me blessed. Therefore she called him Aser.
K.J: And Leah said, Happy am I, for the daughters will call me blessed: and she called his name Asher
And what about all the pressing questions which arise here at SDMB. An important question in a recent thread was “How often do the other Gospels mention that ‘John’ was beloved?” I solved this easily with my searchable text. (Spoiler alert: Zero) (IIRC, I’ve even posted puzzles in the Game Room here, derived from downloaded texts and simple statistics tools. Unlikely to happen again given the reception my posts get lately. :o )
Of course that’s what I did,
Is Mein Kampf downloadable? Perhaps the secret message in pi is in German.
I’m having a hard time find hard numbers about the entropy measure of English (for example). There’s good old Shannon and some updated numbers on entropy of words and a lot “theory” regarding sentences but I haven’t come across some good numbers yet.
So, going with a Shannon-like measure of 1.3 bits/letter then a 20 letter sentence encoded as 1 byte/letter is 20*8=160 bits long but only has 26 bits of real information. That means that there’s 134 bits of “junk”. But that “junk” can’t be too far off or the sentence will make no sense (let alone be understandable as ~ the original sentence).
Now, there’s a lot of comprehensible 20 letter sentences. But compared to something that might be like 2[sup]134[/sup] or so, it’s chicken feed. That’s over 10[sup]40[/sup]. Which is just a wee past trillions of digits.
So:
There might be a lot of interesting information/sentences/whatever in Pi.
It’s overwhelming swamped by junk.
You can’t tell which is which. (Cf. Kolmogorov’s algorithmic information theory.)
I’m having a hard time finding hard numbers about the entropy measure of English (for example). There’s good old Shannon and some updated numbers on entropy of words and a lot “theory” regarding sentences but I haven’t come across some good numbers yet.
Going with a Shannon-like measure of 1.3 bits/letter then a 20 letter sentence encoded as 1 byte/letter is 20*8=160 bits long but only has 26 bits of real information. That means that there’s 134 bits of “junk”. But that “junk” can’t be too far off or the sentence will make no sense (let alone be understandable as ~ the original sentence).
Now, there’s a lot of comprehensible 20 letter sentences. But compared to something that might be like 2[sup]134[/sup] or so, it’s chicken feed. That’s over 10[sup]40[/sup]. Which is just a wee past trillions of digits.
My take:
There might be a lot of interesting information/sentences/whatever in Pi.
It’s overwhelming swamped by junk.
You can’t tell which is which. (Cf. Kolmogorov’s algorithmic information theory.)
One simple measure of the entropy content of English would be how much the standard compression algorithms like PKZip can compress plain text. It’d only be an upper bound, since you can’t be certain that there is no better compression algorithm, but I’m guessing it’s a pretty close bound.
The very best compressors (for files of some type G) have the property that, when presented with a “compressed file” which is actually just random noise, e.g. perhaps the bits of pi, will produce an output (“decompressed file”) which closely resembles the desired target file-type G.
My trigraph compressor/decompressor obviously didn’t do real well — the result doesn’t look like English text.
But what is the target G? In the 1990’s image compression contests used a set of 16 standard test images. The joke was that the best compressor stored only 4 bits: a number 0 to 15 showing which standard image was compressed!