Mathmen: Calculate the odds...(RE Bible Code)

OK, this is the final equation I have. I don’t remember all the fancy notation:

n = text size, in characters
t = target string size, in characters
s = skip spacing (a value of 1 means consecutive letters)

possibilities = n - ts + s
which can also be written as: n-s(t-1)

probability of a string appearing in a text = probability of each letters in the string multiplied together / number of possible sequences to check

To find the denominator, we find the summation of the equation n-s(t-1) from n=1 to int (n-1)/(t-1). (or maybe you want to start with n=2, because finding a string of consecutive letters isn’t so exciting.)

In case you’re wondering where the initial “possibilities” equation comes from, basically, I’m figuring out what the width of our search space is, and then figuring out how many spaces from the end it is (how many times we can shift it and get a new sequence of numbers spaced a certain amount apart).

Actually, I think I missed a one someplace in there, so my possibilities equation seems to be off. Lemme double check.

edit: No, it looks ok.

Solving the Bible Code Puzzle (.pdf)

Actually, come to think of it, there is a bit of an error there. I’m mixing up the terms probability and odds (so I have to rearrange the equation. The numerator in my equation above is the reciprocal of the probability) , and I think this is one of those cases where I should be calculating the probability of not choosing the correct letters, then subtracting that from one. Otherwise, it seems with my equation you can get greater than a probability of 1, which is incorrect.

Whatever formula you get will have been derived on the assumption that the text is random. You can’t turn around and apply it to non-random text and draw any meaningful conclusions.

So even using a frequency analysis of letter probabilities of a certain text isn’t going to give anything approaching a meaningful answer? I can understand with short spacings between letter pairs, the probabilities may be skewed (because of letter pairs and triplets and the sort), but once you get to a wide enough spacing, shouldn’t the probabilities pretty much follow the frequency analysis? In other words, say I have a text with 100,000 characters and it yields a frequency chart exactly mirroring the frequencies in English (like the chart I linked to earlier). If I picked every 139th character, wouldn’t a frequency analysis of my chosen letters be pretty damned close to to the frequencies of the 100,000 character text?

No, because you need frequency information about all two letter sequences, and all three letter sequences, and so on and so forth. As I alluded to above, natural languages are so highly structured that the only thing they really have in common with random texts is that both are made up of letters.

Hmmm…that seems counterintuitive to me. I’ll have to code it up and see for myself. I expect the greater the spacing, the more the selected text approaches the overall frequency of the total text. I don’t see why the structure of the language should make much of a difference – your final frequency analysis should have that taken into account.

If you test it and it does turn out that the random letter model is a reasonable approximation, then that’s fine (and given sufficiently large spacing, it may well be–my last claim may have been a bit strong for very large gaps). The point I originally wanted to make is that you can’t go around just assuming that whatever formulas you’ve derived from a very particular model will apply to any situation that’s superficially similar.

The way that you choose letters in the bible codes means that adjacent letters in the output text are the width of the columns * the skipping distance + or minus the skipping distance. Unless you have a very stylized poetry that depends on the number of letters in the words I don’t see how the letter pairs in words are going to line up very well with the bible code interleaving. So I agree with you Pulykamell the way letters are chosen randomizes the letter pretty well.

Refuting that with good mathematics is what the first several replies were addressing; just not the specific good math that you were thinking of. The fact that you also have to consider other equally-significant outcomes is one of the cornerstones of statistics.

Well, I just tried it in Python on Moby Dick. Frequency analysis for the entire text looks like this:



Characters sorted by frequency:
 e  0.122940
 t  0.092541
 a  0.081592
 o  0.073059
 n  0.068895
 i  0.068809
 s  0.067182
 h  0.065561
 r  0.055122
 l  0.044742

TOTAL CHARACTERS = 967698


With skip = 5 we have:



 e  0.124092
 t  0.091615
 a  0.082123
 o  0.073163
 i  0.069201
 n  0.069071
 s  0.066758
 h  0.064378
 r  0.054934
 l  0.044760


With skip = 109



e  0.125384
 t  0.087871
 a  0.080255
 o  0.073548
 i  0.071388
 n  0.070592
 s  0.065477
 h  0.065250
 r  0.052745
 l  0.041832


Even picking every second letter, we get:



e  0.123340
 t  0.092886
 a  0.081668
 o  0.072961
 i  0.068877
 n  0.068869
 s  0.066820
 h  0.065254
 r  0.055085
 l  0.044929


So, reasonably similar frequencies.

I guess it doesn’t matter that much, but does this mean I’m misunderstanding the Bible code? Is it pick every X letter, then line the text up and see what interesting words are around it? (Which is the assumption I’ve been working with). Or is it something a little different. I may have been misunderstanding how it works, or maybe not.

Ok, so:

First, let’s assume latin alphabet.
Second, if all 26 letters of the alphabet were equally frequent, then there would be a probability of 1/(2626) = 0.15% to find identical letters at two random positions.
Given the actual frequencies, the probability if fA
fA+fBfB+fCfC (etc) where fA,fB,fC (etc) are the frequencies of A,B,C (etc) in english. With the following frequencies posted by pulykamell, we obtain 6.5%, i.e. 1 chance out of 15 rather than one out of 26*26. That makes quite a difference !
Now let us go for our crossing computations. Roughly, we start from a word which occurs somewhere in a text. Then you look for another word which is made of letters appearing at equally spaced locations in the text, one of which location is inside the initial word (hope that this is clear enough). Typically, the “equally spaced locations” means a spacing between 11 and 30 caracters.

For illustration, assume that the initial word is ‘spaghetti’ and the second ‘tomato’. Basically we are going to:
(1) consider all series of 6 equally spaced locations which intersect ‘spaghetti’.
(2) check if ‘tomato’ occurs in each of these serie
(3) look at the point of the intersection, the same letter occurs in ‘spaghetti’ and ‘tomato’ (i.e. even if ‘tomato’ occurs in a certain serie of position, this occurence cannot be valid if the serie interesectoy the second letter of ‘spaghetti’ since ‘p’ does not occur in ‘tomato’)

Sooo,
(1) This is easy, start from each letter in 'spaghetti '(there are 9). Each letter of spaghetti may intersect the first letter of ‘tomato’, or the second, or the third, etc (6 possibilities). You allow spacings between 11 and 30 caracters: This gives you 6920 possibilities.
(2) Easy too. The letters of these locations are random and independant (since they are taken from non-consecutive locations of an english text) letters, with the frequencies as above. So in brief the probability of a match is (6.5%)^6
(3) Easy, this is just one extra match between two random letters, to the probability is 6.5%

So the probability of occurence at a given location is 6.5%^7. The probability of one occurence one of the possible positions is :

1 - (1-6.5%^7)^(6920)

Yes, this is nasty to compute. You can simplify things by computing the average number of times where ‘tomato’ crosses ‘spaghetti’ (this is not the same as the probability that it crosses at least one time, but it will be very close and it is still relevant to compute), it is :

(6.5%^7)*6*9*20

Which is equal to 1/188000. Small, but not so small (and I took a rather long word by chosing ‘tomato’). Finally, assume that you have a list of 100 ingredients or receipies that can be associated with spaghetti, then you get an average of 50/188000 = 1/1880 (we are getting near). Take into account the fact that ‘spaghetti’ occurs 10 times in your cooking book and that you could reiterate the computation with ‘banana’, ‘hamburger’ etc instead of ‘spaghetti’, and you are sure that the crossing will occur.
So, our general formula is:
p = 1 - (1-6.5%^(m+1))^(nmk) or a = (6.5%^(m+1))(nm*k)

with:
n = number of letters in the original word
m = number of letters in the second word
k = number of possible different spacings

p = proability of the crossing occuring
a = average number of occurence (in practice close to p and easier to compute)

This formula is valid for just one string crossing another at a given location where one of the string is already appearing, as in our example. Notice that there can be variants of this formula if you define different kind of crossings (in here the two words are actually intersecting, alternatively it could be that ‘tomato’ points towards ‘spaghetti’ withough actually crossing it. This gives you a lot more of possible locations and releases the constraint number 3)

If you want to compute the probability of a 3-words crossing (e.g. ‘tomato’ crosses ‘spaghetti’, then ‘sauce’ crosses ‘tomato’ OR ‘spaghetti’), then assume that ‘tomato’ and ‘spaghetti’ (with the probability as above), then look for the probability that ‘sauce’ crosses the composite word ‘tomatospaghetti’. This second occurence appears on average 6.5%^6*(15520), i.e. it has 1/9000 chances of occuring. This whole 3-words crossing has therefore a probability of 1/(188000*9000). Sure, this is not a lot, but remembers that I used long words (5 and 6 lettres).

I want to find “enni” in the bible.
I will look at the text, and at skip 2, and skip 3 etc until the skip gets too big.
If there are B letters in the bible, then the number of starts I am using is approximately
S = B + B/2 + B/3 + B/4 + … (S is a REALLY BIG number).

I’ll assume independence between consecutive letters in the ELS candidates - reasonable for skip 100, but very questionable for skip 0. Using Pulykamell’s frequencies, I’ll find “enni” at a given start with probability
p = 0.00004134728

The probability that I do NOT find “enni” at any of the S starts is (1-p) raised to the power S. This probability is just about zero for a large number S of starts.

I have to go make waffles for Boroka.

You’re understanding the technique bible codists use all right. Once the first phrase or word is found, they look for ANYTHING that seems to have an association nearby. Which to them is proof that they were placed there deliberately by some supernatural being.

If I can show that what has been found is seemingly governed by the laws of chance, I think it pokes holes in the supernatural theory.

Lot of people have been doing this sort of things, by looking into profane books for instance. Here for instance various ‘predictions’ found in Moby Dick.

Some more technical refutations of the Bible code here.

Thanks, but I’m well aware of the refutations (although that reference is too dense and too Hebrew for me) and the Moby Dick link was given earlier in this thread. War and Peace has also been used.

But my angle seems a bit different. If I can show that, given a sufficiently large enough book, that any short phrase has the same chance of being found somewhere using ELS as any other book of the same length (and that proves to be the case when tested), then the bible or Torah doesn’t seem so special after all. Unless Drosnin wants to argue that ALL books have codes inserted in them by some god.

The Moby Dick type stuff is the best refutation. It’s a clear, applied demonstration of whatever mathematical result you are looking for. That having been said, as for the mathematical result you seek, although it’s not exactly clear what you want, here’s the basic one which seems to be what you’re asking for:

For any phrase, a sufficiently long book of random text approaches probability 1 of containing it (directly, even!). This is the “infinite monkey theorem”.

(Specifically, when I say “random” text, the assumption being made is that one is using a probability distribution where the letters at different positions are independently and identically distributed, and such that at any position, any particular letter has non-zero probability of occurring)

When the Bible Codists reach the end of the text of the Bible (or whatever else they’re using), do they stop the search or loop back around to the beginning and continue? For a text of n letters, this ‘modulo’ technique would allow for skips of greater than n and would basically extend the searchable text to infinity; albeit one that repeats every n letters.