Assume you have a gene that is 1700 base pairs long. You need to have ten mutations in the right places and they need to be the right mutations (ie, an A turning into a T and not into a C or G). What are the statistical odds of all ten mutations happening at the right base pair, and the right mutation happening? What is the equation for figuring this out?
I figure the odds are (1700*3)^10. Is that the correct way to do the calculation?
I know little about genetics, so these may or may not be appropriate. Different assumptions will lead to different mathematical models, so I’ll throw these out and you can indicate if they’re correct.
Is there an independent probability per base pair to have any kind of mutation (and is it equally likely for any location)?
Assuming there is, call this p.
Are all mutation ‘types’ equally likely (i.e. A->C, A->T, and T->G have the same probability)?
If so, the probability of a ‘good’ mutation is p/3.
Does the whole thing fall apart if any other mutation occurs?
If not, then you just need 10 ‘good’ mutations (p/3)[sup]10[/sup] .
If so, then you need 10 good mutations and 1690 non-mutations, or (p/3)[sup]10[/sup] * (1-p)[sup]1690[/sup] .
This is assuming that by ‘right places’ the ten mutations can only be on particular base pairs. If there’s simply some pattern of mutations that has to be followed, it’s more complex.
This is about the Hemagluttin protein in the bird flu virus changing. The protein is about 560 amino acids long and it needs to undergo about 10 amino acid changes in order for the virus to be transferrable from person to person.
It doesn’t matter if it falls apart if another mutation happens as other viruses will still exist.
The way I’m reading it (and assuming all mutations are equally likely), you have a 1 in C(1700,10) chance of the mutations occurring in the correct 10 base pairs.
Assuming that happens, you have a 1 in 3[sup]10[/sup] chance of those mutations being the correct mutations.
This gives a 1 in 3,194,634,839,885,918,494,521,126,450,120 chance of happening.
I should also mention I don’t really know any genetics, so my answer may be wrong because of that.
Just to be clear, I’m interpreting the question this way:
We have a string of 1700 C’s, G’s, A’s, and T’s:
AGCTGATCGATC…CAGT
and we have a “target” string which differs from the original in exactly 10 places.
If we take the original and randomly change 10 of the letters, what is the probability we will hit our target?
That’s the question I answered; I don’t know if it’s the question the OP asked.
If this is the case, then you should start by looking at this problem directly at the amino acid level. Remember that 3 base pairs specify one codon for a corresponding amino acid, and the codon “code” is redundant (each amino acid can be coded for by several different codons in DNA/RNA). So, while the way you set up the problem is a good start, it doesn’t really model the gene-to-protein system very well. Instead of a single A changing only into C, a given 3 bp codon could change into any number of other codons to get the amino acid needed. (I believe influenza virus genome is single-stranded RNA, by the way.)
If you also take into account the selective pressures of evolution on the organism, and the possibility of horizontal gene transfer between influenza viruses, I think the likelihood of these changes increases greatly.
The odds of one mutation on one base pair isn’t 50% as some people may assume or 25% or 75%, however you are interpreting an exchange would occur. They use a duplication method that is way more dependable than a random event. You have to have access to data that can be extrapolated to find the occurrence of one base pair mutating.
I think the flu virus mutates about 1 out of every 10,000 base pairs.
Okay, basing this again on a poor understanding of genetics, here’s the ways in which to model the problem so far.
Let’s assume as Cabbage said, we have a string of letters, 1700 AGCT’s. Call it S0.
The target string (or how it changes) I’m not sure about, but there are a few ideas.
#1 : The target string is identical to S0, except 10 letters (only) correctly changed.
With the second model I gave :
Each letter has a random chance p of changing to another letter (and equally likely to which letter). The probability that only the 10 correct letters have changed is (p/3)[sup]10[/sup] * (1-p)[sup]1690[/sup]
Cabbage’s model :
Ten letters are picked at random, and they are changed (again, equally likely to any other letter). The probability that the correct 10 were picked, and that they changed correctly is 1/(C(1700,10) * 1/(3[sup]10[/sup]) .
#2 : The target string is identical to S0, the ten letters have changed, and any number of other mutations have also occurred.
In the first model I gave :
Each letter has a random chance p of changing to another letter (and equally likely). The probability that the 10 letters have changed is
(p/3)[sup]10[/sup] . (This was the same as your initial guess, with p= 1/1700).
Trying to incorporate amino acids into this :
We have a string of letters (I’ll stay with AGCT), grouped into triplets(amino acids?). If the start of the triplets must always be in the correct place, then we can model this as a string of 560. However, since they have to be in particular places, the length of the string doesn’t enter into it (as far as I can tell).
The odds of changing to the correct amino acid are different for each amino acid (multiple triplets for each one, right?).
If you have a particular known string you are changing, you need to figure the probability that a particular coding maps to a particular new triple coding (and then to a particular amino acid).
One way to do this would be a 64 x 64 grid. List the triples down the side and across the top (use the same ordering). The vertical is the current triple, and the horizontal is the new triple. (Example : if you have AAC on the side and ATG above, the value we’ll fill in will be the probability that AAC changes to ATG).
Start with a ‘base’ probability using p for probability of a mutation, and q indicating no change. Then at each point in the grid, the ‘base’ value will be a multiple of p’s and q’s depending on whether it changed. (Example : AAC->ATG would be qpp, or qp[sup]2[/sup] .) Note that there’s a lot of symmetry (about both diagonals) so it should be easy to fill out this basic grid.
To get the probability that a particular triplet changes to a particular amino acid, take the sum of all grid entries for the triplets of that amino acid on the line for the old triplet. If you want to know the probability that 10 particular ones have changed, find the values for each current triplet, and multiply them all together.
This was a bit of extra work, but the above grid could also be used in another case. Suppose you don’t know the exact string, but you know the proportions of triplets in the string (e.g. maybe it’s the amino acids that are evenly distributed, not the triplets). In that case, multiply each row of the grid by the fraction representing that triplet’s proportion (effectively its probability) in the initial string. Then, take the sum of a column for the new triplet, and sum again for those that make up the amino acid. Once again, do this for the ten new amino acids, and multiply them together.
This could be simpler, however.
If both the initial strings and the mutations are entirely random, then the mutation rate doesn’t really matter (either way you have a random string). In this case, it’s just a matter of getting those particular amino acids. Each triplet has a 1/64 probability, so sum for a particular amino acid, and then multiply them together again. This may not be entirely correct, as it assumes you could have started with the final string. It also assumes you know which amino acids you want to end up with. (If this is the start of a better model, maybe we can work from it).
I think panamajack’s model makes sense, and it’s probably the best mathematical mode we can provide that fits the OP’s stated assumptions for this problem. However, in the biological world, even a protein with a few of those ten amino acids being “wrong” might still fold correctly, and have an increased ability to infect humans. Once the virus (even a poorly infecting one) makes a jump into humans, all bets are off, because selective pressure consistently throws out bad mutations, and selects those that converge onto the correct (or similar) amino acid sequence.
Harmonious Discord, it sems you are referring to proofreading functions. However, ssRNA viruses like influenza lack any proofreading mechanism, because they don’t have the redundancy of a second, complementary template strand. That’s why influenza mutates so quickly and you have to get re-vaccinated each year.
On second thought, perhaps you were simply pointing out that gene replication is not a completely random process? Sorry for my misunderstanding of your post. However, I don’t think anyone in this thread was making that particular assumption. Wesley Clark’s “1 in 10,000” figure gives the likelihood that a given nucleotide will mutate per each replication of the influenza virus. But, the OP isn’t really concerned with the mutation rate per se, but with the odds that a combination of 10 very specific mutations will occur together, each one happening at random over many viral generations.