Gattaca statistics

kjsheehan · September 15, 2000, 4:39pm

The column is at: http://www.straightdope.com/mailbag/mgattaca.html

I’m a lawyer, not a statistician, so someone may need to correct me, but the probability calculation in the column struck me as a bit off. The column noted that the chance of a random 7 letter sequence being GATTACA, where the letters G, A, T and C are the only letters available, would be 1 in 16,384 (1/4 to the seventh power).

The column then notes that there are about 3 billion nucleic acids in the human genome. Where I think the column may be wrong is in its assertion that this means that there are therefore about 400 million seven letter sequences available (presumably obtained by dividing 3 billion by seven).

I think this underestimates the number of available sequences. The string GATTACA could appear in slots 1-7, or in slots 2-8, 3-9, and so on. This would mean that there are actually about 3 billion minus six sequences available.

However, this would suggest that the string GATTACA should appear in the human genome approximately 183,000 times, which appears to be in conflict with the search results in the column, indicating that the sequence appeared 92,000 times. The search apparently went beyond the human genome, so the number 92,000 overstates the number in the human genome by at least four–the three fruit fly and one e. coli–and likely overstates the actual number by even more. This could be consistent with the 26,000 occurences predicted in the column, while the only way to reconcile the search with my estimate would be if only half or less of the human genome was included in the databases searched.

Of course, trying to statistically predict the number of occurences (however calculated) requires us to assume that each of the four letters is equally likely to occur in any given slot and that the identity of any given slot does not affect the likelihood that another letter would appear close to it. I’d still be interested to hear from a math person if my layman’s instincts on this are correct, though.

Punoqllads · September 15, 2000, 7:41pm

It seems to me that the title, Gattaca, is meant more to be a pun. It sounds just like Attica, the famous prison.

They changed the i to an a, and tacked a G on to the front to give it the four amino acids in the name. It’s meant to reflect how a person’s genes had become their prison.

Montfort · September 16, 2000, 9:37pm

This is a great column, and one that raises my hopes for an aswer the eternal question that has plagued me the entire time I’ve been on the SDMB.

Kudos (and beer!) to Dex and Doug!

RM_Mentock · September 17, 2000, 4:29am

I just want to disagree with the column, when it says “there really is no other catchy way to put those particular letters together.” CATGAG. Think about it.

bibliophage · September 17, 2000, 7:14am

Apparently they didn’t plan from the beginning to compose the title of G, C, A, and T. The working title of the film was The Eighth Day according to this site

Dr.Lao · September 17, 2000, 5:39pm

You are correct that GATTACA could appear in 1-7, 2-8, or 3-9, but if it appears in 1-7 that excludes it from appearing in 2-8, 3-9,…, 7-13. The next place it could appear is in 8-14. That is why you have to use the number of seven letter sequences.

The reason the numbers are different because the search database only searched known genes, not the whole human (or any other species) genome. The majority of DNA doesn’t code for any gene.

RM_Mentock · September 18, 2000, 2:31am

That doesn’t make sense to me.

C_K_Dexter_Haven · September 18, 2000, 2:35pm

RM" << That doesn’t make sense to me >>

OK, say that GATTACA appears in positions 1-7. Note that positions 2, 3, 4, 5, 6, 7 do not start with a G. Position 2 is, in fact, already an A.
Montfort: THanks for the kudos (and beer), but I should note that it is my son, Son of Dex, who did this Mailbag Answer, and not I. He gets the kudos, I get the nakhas.

I have asked him to respond to the probability calcs, and he will do so when he has time. He had written a much longer Answer that explained the calcs in more detail, and I said to short-cut because the Answer was long enough already. And see what happens. Sigh.

[Edited by CKDextHavn on 09-18-2000 at 12:50 PM]

Chronos · September 18, 2000, 4:14pm

The sequence “GATTACA” can appear a maximum of (3 billion / 7) times in the sequence, but that’s not really relevant, since we don’t expect it to be anywhere near the maximum. Any given “GATTACA” string can appear in any one of (3 billion - 6) positions.

Put another way: The case that string 1-7 is GATTACA precludes the possibility that string 2-8 is GATTACA, but the (much more common) case that string 1-7 is not GATTACA does not preclude the possibility of the 2-8 string being so.

RM_Mentock · September 18, 2000, 4:26pm

That still doesn’t mean that 3 billion should be divided by 7. Chronos seems to have covered that.

Dr.Lao · September 19, 2000, 5:26am

Chronos I’m not sure I understand what you are saying. Do you think the statistics were done correctly or not? I think they were correct. You take the probability the sequence will appear in any 7 letter sequence ( (1/4)^7 ) then multiply it by the maximum possible number of independent times it could occur in the human genome ( 3,000,000,000/7 ). That gives the average number of times it should appear in a human genome (assuming nucleotides appear randomly).

Even though I don’t think this relavent I have to nit pick. SHouldn’t it be 3 billion - 12? The seven letter sequence can’t end in positions 1, 2, 3, 4, 5, or 6, but it also can’t start in positions 2999999995, 2999999996, 2999999997, 2999999998, 2999999999, or 3000000000. That excludes 12 sequences which over run the genome and are not allowed.

I just don’t buy the 3 billion - 12 answer. True all those states are possible, but they are not independent states. Independence is key. Once one GATTACA sequence appears it eliminates 12 others from possiblity. When you are talking about the maximum number of states in a probability calculation, the states must be indenpendent.

Dr.Lao · September 19, 2000, 5:50am

I think the 3 billion/7 is too high for the number of independent states. Although it is possible for there to be 3 billion/7 number of GATTACA’s it would occur under very unlikely circumstances. All the GATTACA’s would need to be in the same phase. However, since GATTACA is very unlikely to occur the exclusion of states should be minimal. That is to say everytime a GATTACA appears out of phase, it creates a new phase where nearly 3 billion/7 possiblities exist. Only when different phases interfere are there permanent exclusions of states. If you assume there are 3 billion - 12 states then everytime GATTACA appears it eliminates between 6 and 12 states (depending on whether it appears at the beginning/end or the middle).

So now I’m saying that 3 billion/7 isn’t right either. However, I think it is very close to the correct number of independent states.

BTW, am I making sense to anyone?

Punoqllads · September 19, 2000, 7:50am

Dr. Lao, Chronos has the right idea, though I think he’s off by a tiny bit. Since there are 46 chromosomes in human DNA, I think the potential number of sites is 3,000,000,000 - (46 * 6), instead of 3,000,000,000 - 6.

Break it down into a smaller problem, and you can see why it works this way. Suppose, that instead of 3 billion acids, there were only 9. So there are 4^9 possible states. What do you think the probability is that GATTACA will appear in the nine acids? It’s easy to see that there are 16 * 3 = 48 possible ways it can show up. With 12 base acids, it will show up (12 - 6) * (4^12)/(4^7), for a total of 6144 times out of 16,777,216. With 15 acids, it will show up (15 - 6) * (4^15)/(4^7), for a total of 589,824 times out of 1,073,741,824.

While these numbers are too large to count by hand, they’re small enough to be easily counted by a computer. I wrote a short program to iterate through all possible combinations for 15 base acids, and count up the number of times GATTACA shows up. You can run it yourself, if you have a C compiler:



#include <stdio.h>
#define ACIDS 15
int checkSize = 1 << (ACIDS << 1);
#define A 0
#define C 1
#define G 2
#define T 3
int gattaca = (G << 12) + (A << 10) + (T << 8) + (T << 6)
+ (A << 4) + (C << 2) + (A << 0);

int
findGat(int num)
{
  int result = 0;
  while (num >= gattaca) {
    result += gattaca == num - ((num >> 14) << 14);
    num >>= 2;
  }
  return result;
}

main()
{
  int i, count;
  for(i = 0, count = 0; i < checkSize ; i++) {
    count += findGat(i);
  }
  printf("Found %d occurrences out of %d base acids.
", count, ACIDS);
}

If 3 billion/7 was right, then with 15 base acids, there would only be 2 occurrences of GATTACA. Clearly, 589,824 is not 2.

Dr.Lao · September 19, 2000, 6:02pm

Now I see both sides of the issue and I’m very confused. I hope Son of Dex comes soon to explain the situation. Or failing that an expert in statistical mechanics.

And Punoqllads, I think scaling it down won’t work. It is obvious that in the majority of sequences of a small number won’t contain GATTACA. Therefore the states are nearly independent. However, in a 3 billion base pair sequence GATTACA is expected to appear many times. Many, many, states have a probability of zero for containing GATTACA. As you increase the number of bases, or increase the probability of GATTACA appearing, the estimate that all the bases have an equal probability goes down. By how much requires someone who knows more about this than me.

Punoqllads · September 19, 2000, 6:26pm

You can think that scaling it down won’t work, but you will be wrong. You can use mathematical induction to show that for N base pairs, the same formula will hold true for N+1 base pairs.

I have absolutely no idea what point you are trying to make. To find the expected value, you iterate through all the possible strings, sum the number of times you see GATTACA, and divide by the total number of strings you checked. Adding an additional base pair makes the likelihood of GATTACA appearing go up, not down.

Dr.Lao · September 19, 2000, 7:02pm

Ok, I’ll try and make myself clearer. Let’s say we fixed it that GATTACA had 50% chance of appearing in any seven letter sequence. Now if we look at the human genome, by your method we would expect 1.5 billion appearences of GATTACA ( (1/2)*(3 billion - 6) ). This is clearly ridiculous. The maximum number of time ANY seven letter sequence could appear is 3 billion/7. Do you see what I am saying about not all 3 billion states having an equal probability? Maybe the source of the confusion is something I mistyped.

This should say that the estimate of all the states having equal probability (and hence independent) goes down. You are making an estimate that all 3 billion base have an equal probability, (1/4)^7, of having GATTACA. This is clearly not true based on my example of changing the probability. Some states have a probability of (1/4)^7 and some have a probability of zero (because they are near other GATTACA’s). Is (1/4)^7 close enough to zero for this not to matter? I don’t know.

I’m not sure about the scaling factor; whether it is a valid change or not. I’ll have to think about that.

Dr.Lao · September 19, 2000, 7:06pm

To clarify my clarification, I should say the estimate is less valid. This is what I mean by “goes down.”

Punoqllads · September 19, 2000, 8:12pm

That’s because the setup itself is invalid. The fact that GATTACA can’t overlap on a shifted copy of itself contradicts the initial assumption that the probability of it occurring in any seven-letter sequence is 50%. It’s the statistical equivalent of dividing by zero — you get a nonsense answer.

I see what you’re saying. You’re wrong. There are not 3 billion states; there are 4^3,000,000,000 states. That’s about a 1 with 1.8 billion zeros after it. The minimum number of times GATTACA will appear is 0. The maximum number of times GATTACA will appear is 428,571,428. The average number of times GATTACA will appear is about 183,000. Look, I wrote a Monte Carlo algorithm to check it out, and even it agrees with me:



#include <stdio.h>
#include <stdlib.h>

#define ACIDS 3000000000
#define A 0
#define C 1
#define G 2
#define T 3

unsigned int gattaca = (G << 12) + (A << 10) + (T << 8) +
 (T << 6) + (A << 4) + (C << 2) + (A << 0);
unsigned int gatmask = (1 << 14) - 1;
#define nextGat(num) (((num) << 2)&gatmask) | ((random() >> 9)&0x3)

main()
{
  unsigned int i, count, current;
  for(i = 0, count = 0, current = 0; i <= ACIDS;
      i++, current = nextGat(current)) {
    count += current == gattaca;
  }
  printf("Found %u occurrences out of %u base acids.
", count, ACIDS);
}

I ran it three separate times, with 30M, 300M, and 3B base pairs. The results were:



Found 1899 occurrences out of 30000000 base acids.
Found 18206 occurrences out of 300000000 base acids.
Found 182780 occurrences out of 3000000000 base acids.

In all cases, reasonably close to what Chronos and kjsheehan said. Feel free to rerun it with different random seeds if you think the results will come up any differently.

Dr.Lao · September 19, 2000, 8:51pm

Now we’ve gotten to the heart of the matter. In the human genome the odds that in any set of seven bases GATTACA will appear is not (1/4)^7 because this implies independence and doesn’t take into consideration that GATTACA can’t overlap with itself. The probability is just very close to (1/4)^7. And thanks for this:

This is, finally, the correct way to look at it. Each of the 4^3,000,000,000 is an independent state. Averaging the number of time the sequence appears gives the right answer. Makes sense now.

I agree that this part of the mailbag answer was incorrect an the correct number of times the sequence should appear is approximately 183,000. Thanks for your explaination and patience, Punoqllads.

C_K_Dexter_Haven · September 19, 2000, 8:59pm

Sigh. So I went to a expert in combinatorial probabilities, and he gave me:

If n = length of sequence (in this case 3 billion)
k = length of subsequence to match (in this case 7)
t = number of equally likely possible values for a position in the sequence (in this case 4)

Then the expected number of matching subsequences is:

[n - (k-1)]/[t^k} = 183,105.468

If we actually divide the number of spaces into the 48 distinct chromosomes, and assume an equal number of spaces on each chromosome (the assumption won’t matter, as you’ll see), then we get 183,105.451 … the 3 billion is so overwhelming that the minor matter of losing a six more possibilities at the end of 48 strings just doesn’t come into play.

So it appears that Son of Dex miscalculated slightly; what happens when he does this stuff under stress. Thanks, kjsheehan, for finding this (even if you are a lawyer.) We now need to figure why the computer search turned up about half of the expected number, but I think this has to do with something occurring in pairs (this was some bio-chem gobbledy-gook that I didn’t understand.) ’

More to come.

Topic		Replies	Views
Odds concerning monkeys and the lottery Factual Questions	34	1623	January 26, 2004
Can someone teach me genetics 101? Factual Questions	28	2587	November 18, 2002
Read my DNA Factual Questions	9	1375	October 26, 2005
Really Big Math Question Factual Questions	17	1450	January 16, 2007
Refute The Carter Hypothesis (Statistics & Probability) Great Debates	27	2000	December 5, 2003

Gattaca statistics

Related topics