We know how long a DNA strand is right? How many pairs it has? I have heard that the strands them self are quite quite long. Each pair on a strand is binary right since it could be either 1 of 2, AT or GC so how much information in bits could we fit on a DNA strand?
No, a pair can be one of four. AT is not the same at TA. Only one strand is read by the transcriptase, so you can ignore the other “complementary” nucleotide. So the meaningful nucleotide can be either C, T, G, or A.
DNA strands vary wildly in length. There are 23 different human chromosomes, for example, which range in length from 51 million to 245 million base pairs. (Taken from here.) There are DNA-based viruses that have DNA strands that are as short as 3200 base pairs. We can make relatively short strands, but right now, my understanding is that we can’t easily synthesize something as long as a human chromosome.
It’s not really binary. Even though A pairs with T, and G with C, DNA reading AGGCTC (or TCCGAG on its paired, reflecting strand) says something totally different than something reading TGCGTC. (The first one says Arginine Leucine, and the second one says Cysteine Valine, for what it’s worth.) If A and T were both 0, and G and C both 1 (in your proposed binary coding scheme), the two little DNA codes would say the same thing.
The naive encoding requires two bits per letter, but you can do slightly better with more sophisticated schemes. This paper seems to suggest that on average you need about 1.7 bits per nucleotide, but there’s significant variation from sequence to sequence.
You can use 5 bits to represent each of the 20 amino acids plus a start and stop. That replaces 3 base pairs, so that’s 1.667 bits per pair. But that’s getting away from the OP, I guess.
Right. There are also sections that are pretty clearly junk (like a damaged gene that has been replaced by another copy somewhere else), sections that don’t code for anything, but seem to be important for the shape and structure of the chromosome, and sections that really don’t seem to do anything at all (but maybe they do and we just don’t know what). It’s kind of like a computer hard drive: there’s a lot of ‘data’ data, plus a bunch of header information for each file, information about the disk directory and file structure, unused space at the end of files that isn’t data but can’t be used by any other file, space that isn’t used but is marked as allocated because of an error in the file system, backup file data, a permanently allocated swap file, etc. You can decide some part of that isn’t real ‘data’ and not count it, but where you draw the line is arbitrary; the only objective number is the total disk space minus the free space for files. For DNA, the only real hard number we can give is 2 bits per base pair.
Except in a real chromosome, there is more information than just the base pairs: DNA can be modified with, for example, methylation, that changes the chemical structure of base pairs, as well as other various modifications that may be chemical changes to the DNA molecule or may be changes to the supporting structures and physical geometry of the DNA. Difficult to impossible to put hard numbers on this. One could draw the line at chemical changes to the DNA molecule itself and add one bit per bas pair for methylated/non-methylated, I suppose, though I’m sure that’s overestimating the actual information content.
I imagine that you could get a reasonable estimate by taking a database of a complete human genome and running it through any of the standard compression programs.
And while different nucleotide sequences can code for the same amino acid, I would still recommend counting them as distinct for information-counting purposes. There are at least some cases known of overlapping genes at different phases, and even if two codons code for the same amino acid, they don’t have the same set of things they can easily mutate into.
While you could compress things down to just represent the 22 end states (20 amino acids and stop and start), that’s a lossy compression since you can’t get back the exact nucleotide sequence.
The vast majority of our DNA (something on the order of 98%) is what used to be referred to as ‘junk DNA,’ ie it doesn’t code for proteins. We’ve realized more recently that (as mr. jp referred to above) a lot of the noncoding regions are doing other stuff - it regulates the expression of coding regions, codes for RNAs that do specific jobs aside from being the messengers for coding DNA, etc. Still, the majority of DNA really does seem to be useless, as far as we can tell right now - old viruses that incorporated themselves in and made many copies, nonfunctional genes that used to code for something (pseudogenes), stuff that jumps around in the genome inserting more and more copies of itself, etc. You can get a ton of randomness in all of that stuff, because there’s no selective pressure forcing it to make any sense.
This reminds me of one of the more interesting threads here on the Dope, wherein it was attempted to calculate the bandwidth of the average ejaculation.
See, that’s the part that surprises me. Something that ends up making a whole bunch of copies of itself isn’t random. There’s no selective pressure pushing it to make sense, but it should start off with a certain sort of sense to begin with. It’s easy to compress something consisting of a bunch of copies of something relatively short.
If you look at the paper I linked to, there are algorithms that will compress genetic information, but they’re not standard. It’s also worth keeping in mind that randomness isn’t tied to how well any particular algorithm can compress a sequence; it’s tied to how well the best algorithm for that sequence compresses it.
That’s true, but there’s no selection pressure to prevent these repetitive elements from mutating, so they accumulate changes as fast as any other useless bit of the genome. There’s enough similarity to identify the sequences, but that doesn’t have to be a lot. The “junk” copied elements have been accumulating for billions of years. The oldest have probably mutated beyond recognition, and only the most recent insertions will actually represent functional, identical copies.
If only considering genes:
about 30% of genes are known to undergo intron/exon rearrangement, where different bits of mRNA code are cut out after transcription to to make new combinations for translation into polypeptide chains.
Therefore, one portion of DNA sequence can contain more information than just the sequence along its length.
A lot of the DNA molecule is “junk” as mentioned above, but we do use information from the junk (short tandem repeats, minisatellites) in DNA profiling and fingerprinting. Using gene sequences to differentiate would be pretty useless, as our genes are extremely similar, between individuals.
So basically, what do you count as information, and what doesn’t?
The genome is the ultimate in spaghetti code. As pointed out above, even “useless” repetitions and mutations will have effects on the shape of the chromosome and relative positioning of other genes and can affect the phenotype. As we continue to study genetics I believe we will find that many of the differences between individuals will be traceable to differences in the “junk” DNA.