The column is at: http://www.straightdope.com/mailbag/mgattaca.html
I’m a lawyer, not a statistician, so someone may need to correct me, but the probability calculation in the column struck me as a bit off. The column noted that the chance of a random 7 letter sequence being GATTACA, where the letters G, A, T and C are the only letters available, would be 1 in 16,384 (1/4 to the seventh power).
The column then notes that there are about 3 billion nucleic acids in the human genome. Where I think the column may be wrong is in its assertion that this means that there are therefore about 400 million seven letter sequences available (presumably obtained by dividing 3 billion by seven).
I think this underestimates the number of available sequences. The string GATTACA could appear in slots 1-7, or in slots 2-8, 3-9, and so on. This would mean that there are actually about 3 billion minus six sequences available.
However, this would suggest that the string GATTACA should appear in the human genome approximately 183,000 times, which appears to be in conflict with the search results in the column, indicating that the sequence appeared 92,000 times. The search apparently went beyond the human genome, so the number 92,000 overstates the number in the human genome by at least four–the three fruit fly and one e. coli–and likely overstates the actual number by even more. This could be consistent with the 26,000 occurences predicted in the column, while the only way to reconcile the search with my estimate would be if only half or less of the human genome was included in the databases searched.
Of course, trying to statistically predict the number of occurences (however calculated) requires us to assume that each of the four letters is equally likely to occur in any given slot and that the identity of any given slot does not affect the likelihood that another letter would appear close to it. I’d still be interested to hear from a math person if my layman’s instincts on this are correct, though.