Gattaca statistics

The column is at: http://www.straightdope.com/mailbag/mgattaca.html

I’m a lawyer, not a statistician, so someone may need to correct me, but the probability calculation in the column struck me as a bit off. The column noted that the chance of a random 7 letter sequence being GATTACA, where the letters G, A, T and C are the only letters available, would be 1 in 16,384 (1/4 to the seventh power).

The column then notes that there are about 3 billion nucleic acids in the human genome. Where I think the column may be wrong is in its assertion that this means that there are therefore about 400 million seven letter sequences available (presumably obtained by dividing 3 billion by seven).

I think this underestimates the number of available sequences. The string GATTACA could appear in slots 1-7, or in slots 2-8, 3-9, and so on. This would mean that there are actually about 3 billion minus six sequences available.

However, this would suggest that the string GATTACA should appear in the human genome approximately 183,000 times, which appears to be in conflict with the search results in the column, indicating that the sequence appeared 92,000 times. The search apparently went beyond the human genome, so the number 92,000 overstates the number in the human genome by at least four–the three fruit fly and one e. coli–and likely overstates the actual number by even more. This could be consistent with the 26,000 occurences predicted in the column, while the only way to reconcile the search with my estimate would be if only half or less of the human genome was included in the databases searched.

Of course, trying to statistically predict the number of occurences (however calculated) requires us to assume that each of the four letters is equally likely to occur in any given slot and that the identity of any given slot does not affect the likelihood that another letter would appear close to it. I’d still be interested to hear from a math person if my layman’s instincts on this are correct, though.

It seems to me that the title, Gattaca, is meant more to be a pun. It sounds just like Attica, the famous prison.

They changed the i to an a, and tacked a G on to the front to give it the four amino acids in the name. It’s meant to reflect how a person’s genes had become their prison.

This is a great column, and one that raises my hopes for an aswer the eternal question that has plagued me the entire time I’ve been on the SDMB.

Kudos (and beer!) to Dex and Doug!

I just want to disagree with the column, when it says “there really is no other catchy way to put those particular letters together.” CATGAG. Think about it.

Apparently they didn’t plan from the beginning to compose the title of G, C, A, and T. The working title of the film was The Eighth Day according to this site

You are correct that GATTACA could appear in 1-7, 2-8, or 3-9, but if it appears in 1-7 that excludes it from appearing in 2-8, 3-9,…, 7-13. The next place it could appear is in 8-14. That is why you have to use the number of seven letter sequences.

The reason the numbers are different because the search database only searched known genes, not the whole human (or any other species) genome. The majority of DNA doesn’t code for any gene.

That doesn’t make sense to me.

RM" << That doesn’t make sense to me >>

OK, say that GATTACA appears in positions 1-7. Note that positions 2, 3, 4, 5, 6, 7 do not start with a G. Position 2 is, in fact, already an A.
Montfort: THanks for the kudos (and beer), but I should note that it is my son, Son of Dex, who did this Mailbag Answer, and not I. He gets the kudos, I get the nakhas.

I have asked him to respond to the probability calcs, and he will do so when he has time. He had written a much longer Answer that explained the calcs in more detail, and I said to short-cut because the Answer was long enough already. And see what happens. Sigh.

[Edited by CKDextHavn on 09-18-2000 at 12:50 PM]

The sequence “GATTACA” can appear a maximum of (3 billion / 7) times in the sequence, but that’s not really relevant, since we don’t expect it to be anywhere near the maximum. Any given “GATTACA” string can appear in any one of (3 billion - 6) positions.

Put another way: The case that string 1-7 is GATTACA precludes the possibility that string 2-8 is GATTACA, but the (much more common) case that string 1-7 is not GATTACA does not preclude the possibility of the 2-8 string being so.

That still doesn’t mean that 3 billion should be divided by 7. Chronos seems to have covered that.

Chronos I’m not sure I understand what you are saying. Do you think the statistics were done correctly or not? I think they were correct. You take the probability the sequence will appear in any 7 letter sequence ( (1/4)^7 ) then multiply it by the maximum possible number of independent times it could occur in the human genome ( 3,000,000,000/7 ). That gives the average number of times it should appear in a human genome (assuming nucleotides appear randomly).

Even though I don’t think this relavent I have to nit pick. SHouldn’t it be 3 billion - 12? The seven letter sequence can’t end in positions 1, 2, 3, 4, 5, or 6, but it also can’t start in positions 2999999995, 2999999996, 2999999997, 2999999998, 2999999999, or 3000000000. That excludes 12 sequences which over run the genome and are not allowed.

I just don’t buy the 3 billion - 12 answer. True all those states are possible, but they are not independent states. Independence is key. Once one GATTACA sequence appears it eliminates 12 others from possiblity. When you are talking about the maximum number of states in a probability calculation, the states must be indenpendent.

I think the 3 billion/7 is too high for the number of independent states. Although it is possible for there to be 3 billion/7 number of GATTACA’s it would occur under very unlikely circumstances. All the GATTACA’s would need to be in the same phase. However, since GATTACA is very unlikely to occur the exclusion of states should be minimal. That is to say everytime a GATTACA appears out of phase, it creates a new phase where nearly 3 billion/7 possiblities exist. Only when different phases interfere are there permanent exclusions of states. If you assume there are 3 billion - 12 states then everytime GATTACA appears it eliminates between 6 and 12 states (depending on whether it appears at the beginning/end or the middle).

So now I’m saying that 3 billion/7 isn’t right either. However, I think it is very close to the correct number of independent states.

BTW, am I making sense to anyone?

Dr. Lao, Chronos has the right idea, though I think he’s off by a tiny bit. Since there are 46 chromosomes in human DNA, I think the potential number of sites is 3,000,000,000 - (46 * 6), instead of 3,000,000,000 - 6.

Break it down into a smaller problem, and you can see why it works this way. Suppose, that instead of 3 billion acids, there were only 9. So there are 4^9 possible states. What do you think the probability is that GATTACA will appear in the nine acids? It’s easy to see that there are 16 * 3 = 48 possible ways it can show up. With 12 base acids, it will show up (12 - 6) * (4^12)/(4^7), for a total of 6144 times out of 16,777,216. With 15 acids, it will show up (15 - 6) * (4^15)/(4^7), for a total of 589,824 times out of 1,073,741,824.

While these numbers are too large to count by hand, they’re small enough to be easily counted by a computer. I wrote a short program to iterate through all possible combinations for 15 base acids, and count up the number of times GATTACA shows up. You can run it yourself, if you have a C compiler:



#include <stdio.h>
#define ACIDS 15
int checkSize = 1 << (ACIDS << 1);
#define A 0
#define C 1
#define G 2
#define T 3
int gattaca = (G << 12) + (A << 10) + (T << 8) + (T << 6)
+ (A << 4) + (C << 2) + (A << 0);

int
findGat(int num)
{
  int result = 0;
  while (num >= gattaca) {
    result += gattaca == num - ((num >> 14) << 14);
    num >>= 2;
  }
  return result;
}

main()
{
  int i, count;
  for(i = 0, count = 0; i < checkSize ; i++) {
    count += findGat(i);
  }
  printf("Found %d occurrences out of %d base acids.
", count, ACIDS);
}

If 3 billion/7 was right, then with 15 base acids, there would only be 2 occurrences of GATTACA. Clearly, 589,824 is not 2.

Now I see both sides of the issue and I’m very confused. I hope Son of Dex comes soon to explain the situation. Or failing that an expert in statistical mechanics.

And Punoqllads, I think scaling it down won’t work. It is obvious that in the majority of sequences of a small number won’t contain GATTACA. Therefore the states are nearly independent. However, in a 3 billion base pair sequence GATTACA is expected to appear many times. Many, many, states have a probability of zero for containing GATTACA. As you increase the number of bases, or increase the probability of GATTACA appearing, the estimate that all the bases have an equal probability goes down. By how much requires someone who knows more about this than me.

You can think that scaling it down won’t work, but you will be wrong. You can use mathematical induction to show that for N base pairs, the same formula will hold true for N+1 base pairs.

I have absolutely no idea what point you are trying to make. To find the expected value, you iterate through all the possible strings, sum the number of times you see GATTACA, and divide by the total number of strings you checked. Adding an additional base pair makes the likelihood of GATTACA appearing go up, not down.

Ok, I’ll try and make myself clearer. Let’s say we fixed it that GATTACA had 50% chance of appearing in any seven letter sequence. Now if we look at the human genome, by your method we would expect 1.5 billion appearences of GATTACA ( (1/2)*(3 billion - 6) ). This is clearly ridiculous. The maximum number of time ANY seven letter sequence could appear is 3 billion/7. Do you see what I am saying about not all 3 billion states having an equal probability? Maybe the source of the confusion is something I mistyped.

This should say that the estimate of all the states having equal probability (and hence independent) goes down. You are making an estimate that all 3 billion base have an equal probability, (1/4)^7, of having GATTACA. This is clearly not true based on my example of changing the probability. Some states have a probability of (1/4)^7 and some have a probability of zero (because they are near other GATTACA’s). Is (1/4)^7 close enough to zero for this not to matter? I don’t know.

I’m not sure about the scaling factor; whether it is a valid change or not. I’ll have to think about that.

To clarify my clarification, I should say the estimate is less valid. This is what I mean by “goes down.”

That’s because the setup itself is invalid. The fact that GATTACA can’t overlap on a shifted copy of itself contradicts the initial assumption that the probability of it occurring in any seven-letter sequence is 50%. It’s the statistical equivalent of dividing by zero — you get a nonsense answer.

I see what you’re saying. You’re wrong. There are not 3 billion states; there are 4^3,000,000,000 states. That’s about a 1 with 1.8 billion zeros after it. The minimum number of times GATTACA will appear is 0. The maximum number of times GATTACA will appear is 428,571,428. The average number of times GATTACA will appear is about 183,000. Look, I wrote a Monte Carlo algorithm to check it out, and even it agrees with me:



#include <stdio.h>
#include <stdlib.h>

#define ACIDS 3000000000
#define A 0
#define C 1
#define G 2
#define T 3

unsigned int gattaca = (G << 12) + (A << 10) + (T << 8) +
 (T << 6) + (A << 4) + (C << 2) + (A << 0);
unsigned int gatmask = (1 << 14) - 1;
#define nextGat(num) (((num) << 2)&gatmask) | ((random() >> 9)&0x3)

main()
{
  unsigned int i, count, current;
  for(i = 0, count = 0, current = 0; i <= ACIDS;
      i++, current = nextGat(current)) {
    count += current == gattaca;
  }
  printf("Found %u occurrences out of %u base acids.
", count, ACIDS);
}


I ran it three separate times, with 30M, 300M, and 3B base pairs. The results were:



Found 1899 occurrences out of 30000000 base acids.
Found 18206 occurrences out of 300000000 base acids.
Found 182780 occurrences out of 3000000000 base acids.


In all cases, reasonably close to what Chronos and kjsheehan said. Feel free to rerun it with different random seeds if you think the results will come up any differently.

Now we’ve gotten to the heart of the matter. In the human genome the odds that in any set of seven bases GATTACA will appear is not (1/4)^7 because this implies independence and doesn’t take into consideration that GATTACA can’t overlap with itself. The probability is just very close to (1/4)^7. And thanks for this:

This is, finally, the correct way to look at it. Each of the 4^3,000,000,000 is an independent state. Averaging the number of time the sequence appears gives the right answer. Makes sense now.

I agree that this part of the mailbag answer was incorrect an the correct number of times the sequence should appear is approximately 183,000. Thanks for your explaination and patience, Punoqllads.

Sigh. So I went to a expert in combinatorial probabilities, and he gave me:

If n = length of sequence (in this case 3 billion)
k = length of subsequence to match (in this case 7)
t = number of equally likely possible values for a position in the sequence (in this case 4)

Then the expected number of matching subsequences is:

[n - (k-1)]/[t^k} = 183,105.468

If we actually divide the number of spaces into the 48 distinct chromosomes, and assume an equal number of spaces on each chromosome (the assumption won’t matter, as you’ll see), then we get 183,105.451 … the 3 billion is so overwhelming that the minor matter of losing a six more possibilities at the end of 48 strings just doesn’t come into play.

So it appears that Son of Dex miscalculated slightly; what happens when he does this stuff under stress. Thanks, kjsheehan, for finding this (even if you are a lawyer.) We now need to figure why the computer search turned up about half of the expected number, but I think this has to do with something occurring in pairs (this was some bio-chem gobbledy-gook that I didn’t understand.) ’

More to come.