Ok, so:
First, let’s assume latin alphabet.
Second, if all 26 letters of the alphabet were equally frequent, then there would be a probability of 1/(2626) = 0.15% to find identical letters at two random positions.
Given the actual frequencies, the probability if fAfA+fBfB+fCfC (etc) where fA,fB,fC (etc) are the frequencies of A,B,C (etc) in english. With the following frequencies posted by pulykamell, we obtain 6.5%, i.e. 1 chance out of 15 rather than one out of 26*26. That makes quite a difference !
Now let us go for our crossing computations. Roughly, we start from a word which occurs somewhere in a text. Then you look for another word which is made of letters appearing at equally spaced locations in the text, one of which location is inside the initial word (hope that this is clear enough). Typically, the “equally spaced locations” means a spacing between 11 and 30 caracters.
For illustration, assume that the initial word is ‘spaghetti’ and the second ‘tomato’. Basically we are going to:
(1) consider all series of 6 equally spaced locations which intersect ‘spaghetti’.
(2) check if ‘tomato’ occurs in each of these serie
(3) look at the point of the intersection, the same letter occurs in ‘spaghetti’ and ‘tomato’ (i.e. even if ‘tomato’ occurs in a certain serie of position, this occurence cannot be valid if the serie interesectoy the second letter of ‘spaghetti’ since ‘p’ does not occur in ‘tomato’)
Sooo,
(1) This is easy, start from each letter in 'spaghetti '(there are 9). Each letter of spaghetti may intersect the first letter of ‘tomato’, or the second, or the third, etc (6 possibilities). You allow spacings between 11 and 30 caracters: This gives you 6920 possibilities.
(2) Easy too. The letters of these locations are random and independant (since they are taken from non-consecutive locations of an english text) letters, with the frequencies as above. So in brief the probability of a match is (6.5%)^6
(3) Easy, this is just one extra match between two random letters, to the probability is 6.5%
So the probability of occurence at a given location is 6.5%^7. The probability of one occurence one of the possible positions is :
1 - (1-6.5%^7)^(6920)
Yes, this is nasty to compute. You can simplify things by computing the average number of times where ‘tomato’ crosses ‘spaghetti’ (this is not the same as the probability that it crosses at least one time, but it will be very close and it is still relevant to compute), it is :
(6.5%^7)*6*9*20
Which is equal to 1/188000. Small, but not so small (and I took a rather long word by chosing ‘tomato’). Finally, assume that you have a list of 100 ingredients or receipies that can be associated with spaghetti, then you get an average of 50/188000 = 1/1880 (we are getting near). Take into account the fact that ‘spaghetti’ occurs 10 times in your cooking book and that you could reiterate the computation with ‘banana’, ‘hamburger’ etc instead of ‘spaghetti’, and you are sure that the crossing will occur.
So, our general formula is:
p = 1 - (1-6.5%^(m+1))^(nmk) or a = (6.5%^(m+1))(nm*k)
with:
n = number of letters in the original word
m = number of letters in the second word
k = number of possible different spacings
p = proability of the crossing occuring
a = average number of occurence (in practice close to p and easier to compute)
This formula is valid for just one string crossing another at a given location where one of the string is already appearing, as in our example. Notice that there can be variants of this formula if you define different kind of crossings (in here the two words are actually intersecting, alternatively it could be that ‘tomato’ points towards ‘spaghetti’ withough actually crossing it. This gives you a lot more of possible locations and releases the constraint number 3)
If you want to compute the probability of a 3-words crossing (e.g. ‘tomato’ crosses ‘spaghetti’, then ‘sauce’ crosses ‘tomato’ OR ‘spaghetti’), then assume that ‘tomato’ and ‘spaghetti’ (with the probability as above), then look for the probability that ‘sauce’ crosses the composite word ‘tomatospaghetti’. This second occurence appears on average 6.5%^6*(15520), i.e. it has 1/9000 chances of occuring. This whole 3-words crossing has therefore a probability of 1/(188000*9000). Sure, this is not a lot, but remembers that I used long words (5 and 6 lettres).