#1




The column is at: http://www.straightdope.com/mailbag/mgattaca.html
I'm a lawyer, not a statistician, so someone may need to correct me, but the probability calculation in the column struck me as a bit off. The column noted that the chance of a random 7 letter sequence being GATTACA, where the letters G, A, T and C are the only letters available, would be 1 in 16,384 (1/4 to the seventh power). The column then notes that there are about 3 billion nucleic acids in the human genome. Where I think the column may be wrong is in its assertion that this means that there are therefore about 400 million seven letter sequences available (presumably obtained by dividing 3 billion by seven). I think this underestimates the number of available sequences. The string GATTACA could appear in slots 17, or in slots 28, 39, and so on. This would mean that there are actually about 3 billion minus six sequences available. However, this would suggest that the string GATTACA should appear in the human genome approximately 183,000 times, which appears to be in conflict with the search results in the column, indicating that the sequence appeared 92,000 times. The search apparently went beyond the human genome, so the number 92,000 overstates the number in the human genome by at least fourthe three fruit fly and one e. coliand likely overstates the actual number by even more. This could be consistent with the 26,000 occurences predicted in the column, while the only way to reconcile the search with my estimate would be if only half or less of the human genome was included in the databases searched. Of course, trying to statistically predict the number of occurences (however calculated) requires us to assume that each of the four letters is equally likely to occur in any given slot and that the identity of any given slot does not affect the likelihood that another letter would appear close to it. I'd still be interested to hear from a math person if my layman's instincts on this are correct, though. 
Advertisements  


#2




It seems to me that the title, Gattaca, is meant more to be a pun. It sounds just like Attica, the famous prison.
They changed the i to an a, and tacked a G on to the front to give it the four amino acids in the name. It's meant to reflect how a person's genes had become their prison.
__________________
Loading, please wait... 
#3




This is a great column, and one that raises my hopes for an aswer the eternal question that has plagued me the entire time I've been on the SDMB.
Kudos (and beer!) to Dex and Doug!
__________________
First Anniz said "Yes,", then she said "I do," and then she [url="http://boards.straightdope.com/sdmb/showthread.php?threadid=209517"]bore my son. How can you not love her? I asked and Cecil answered me. 
#4




I just want to disagree with the column, when it says "there really is no other catchy way to put those particular letters together." CATGAG. Think about it.
__________________
rocks 
#6




Quote:
Quote:

#7




Quote:

#8




RM" << That doesn't make sense to me >>
OK, say that GATTACA appears in positions 17. Note that positions 2, 3, 4, 5, 6, 7 do not start with a G. Position 2 is, in fact, already an A. Montfort: THanks for the kudos (and beer), but I should note that it is my son, Son of Dex, who did this Mailbag Answer, and not I. He gets the kudos, I get the nakhas. I have asked him to respond to the probability calcs, and he will do so when he has time. He had written a much longer Answer that explained the calcs in more detail, and I said to shortcut because the Answer was long enough already. And see what happens. Sigh. [Edited by CKDextHavn on 09182000 at 12:50 PM] 
#9




The sequence "GATTACA" can appear a maximum of (3 billion / 7) times in the sequence, but that's not really relevant, since we don't expect it to be anywhere near the maximum. Any given "GATTACA" string can appear in any one of (3 billion  6) positions.
Put another way: The case that string 17 is GATTACA precludes the possibility that string 28 is GATTACA, but the (much more common) case that string 17 is not GATTACA does not preclude the possibility of the 28 string being so.
__________________
Time travels in divers paces with divers persons. As You Like It, III:ii:328 
#10




Quote:

#11




Chronos I'm not sure I understand what you are saying. Do you think the statistics were done correctly or not? I think they were correct. You take the probability the sequence will appear in any 7 letter sequence ( (1/4)^7 ) then multiply it by the maximum possible number of independent times it could occur in the human genome ( 3,000,000,000/7 ). That gives the average number of times it should appear in a human genome (assuming nucleotides appear randomly).
Quote:
I just don't buy the 3 billion  12 answer. True all those states are possible, but they are not independent states. Independence is key. Once one GATTACA sequence appears it eliminates 12 others from possiblity. When you are talking about the maximum number of states in a probability calculation, the states must be indenpendent. 
#12




The more I think about it...
I think the 3 billion/7 is too high for the number of independent states. Although it is possible for there to be 3 billion/7 number of GATTACA's it would occur under very unlikely circumstances. All the GATTACA's would need to be in the same phase. However, since GATTACA is very unlikely to occur the exclusion of states should be minimal. That is to say everytime a GATTACA appears out of phase, it creates a new phase where nearly 3 billion/7 possiblities exist. Only when different phases interfere are there permanent exclusions of states. If you assume there are 3 billion  12 states then everytime GATTACA appears it eliminates between 6 and 12 states (depending on whether it appears at the beginning/end or the middle).
So now I'm saying that 3 billion/7 isn't right either. However, I think it is very close to the correct number of independent states. BTW, am I making sense to anyone? 
#13




Re: The more I think about it...
Dr. Lao, Chronos has the right idea, though I think he's off by a tiny bit. Since there are 46 chromosomes in human DNA, I think the potential number of sites is 3,000,000,000  (46 * 6), instead of 3,000,000,000  6.
Break it down into a smaller problem, and you can see why it works this way. Suppose, that instead of 3 billion acids, there were only 9. So there are 4^9 possible states. What do you think the probability is that GATTACA will appear in the nine acids? It's easy to see that there are 16 * 3 = 48 possible ways it can show up. With 12 base acids, it will show up (12  6) * (4^12)/(4^7), for a total of 6144 times out of 16,777,216. With 15 acids, it will show up (15  6) * (4^15)/(4^7), for a total of 589,824 times out of 1,073,741,824. While these numbers are too large to count by hand, they're small enough to be easily counted by a computer. I wrote a short program to iterate through all possible combinations for 15 base acids, and count up the number of times GATTACA shows up. You can run it yourself, if you have a C compiler: Code:
#include <stdio.h> #define ACIDS 15 int checkSize = 1 << (ACIDS << 1); #define A 0 #define C 1 #define G 2 #define T 3 int gattaca = (G << 12) + (A << 10) + (T << 8) + (T << 6) + (A << 4) + (C << 2) + (A << 0); int findGat(int num) { int result = 0; while (num >= gattaca) { result += gattaca == num  ((num >> 14) << 14); num >>= 2; } return result; } main() { int i, count; for(i = 0, count = 0; i < checkSize ; i++) { count += findGat(i); } printf("Found %d occurrences out of %d base acids.\n", count, ACIDS); } 
#14




Re: The more I think about it...
Now I see both sides of the issue and I'm very confused. I hope Son of Dex comes soon to explain the situation. Or failing that an expert in statistical mechanics.
And Punoqllads, I think scaling it down won't work. It is obvious that in the majority of sequences of a small number won't contain GATTACA. Therefore the states are nearly independent. However, in a 3 billion base pair sequence GATTACA is expected to appear many times. Many, many, states have a probability of zero for containing GATTACA. As you increase the number of bases, or increase the probability of GATTACA appearing, the estimate that all the bases have an equal probability goes down. By how much requires someone who knows more about this than me. 
#15




Re: Re: The more I think about it...
Quote:
Quote:
__________________
Loading, please wait... 
#16




Quote:
Quote:
I'm not sure about the scaling factor; whether it is a valid change or not. I'll have to think about that. 
#17




Quote:

#18




Quote:
Quote:
Code:
#include <stdio.h> #include <stdlib.h> #define ACIDS 3000000000 #define A 0 #define C 1 #define G 2 #define T 3 unsigned int gattaca = (G << 12) + (A << 10) + (T << 8) + (T << 6) + (A << 4) + (C << 2) + (A << 0); unsigned int gatmask = (1 << 14)  1; #define nextGat(num) (((num) << 2)&gatmask)  ((random() >> 9)&0x3) main() { unsigned int i, count, current; for(i = 0, count = 0, current = 0; i <= ACIDS; i++, current = nextGat(current)) { count += current == gattaca; } printf("Found %u occurrences out of %u base acids.\n", count, ACIDS); } Code:
Found 1899 occurrences out of 30000000 base acids. Found 18206 occurrences out of 300000000 base acids. Found 182780 occurrences out of 3000000000 base acids. 
#19




Quote:
Quote:
I agree that this part of the mailbag answer was incorrect an the correct number of times the sequence should appear is approximately 183,000. Thanks for your explaination and patience, Punoqllads. 
#20




Sigh. So I went to a expert in combinatorial probabilities, and he gave me:
If n = length of sequence (in this case 3 billion) k = length of subsequence to match (in this case 7) t = number of equally likely possible values for a position in the sequence (in this case 4) Then the expected number of matching subsequences is: [n  (k1)]/[t^k} = 183,105.468 If we actually divide the number of spaces into the 48 distinct chromosomes, and assume an equal number of spaces on each chromosome (the assumption won't matter, as you'll see), then we get 183,105.451 .... the 3 billion is so overwhelming that the minor matter of losing a six more possibilities at the end of 48 strings just doesn't come into play. So it appears that Son of Dex miscalculated slightly; what happens when he does this stuff under stress. Thanks, kjsheehan, for finding this (even if you are a lawyer.) We now need to figure why the computer search turned up about half of the expected number, but I think this has to do with something occurring in pairs (this was some biochem gobbledygook that I didn't understand.) ' More to come. 
Bookmarks 
Thread Tools  
Display Modes  

