How many bits of information are contained in a human DNA strand.

ChrisBooth12 · August 19, 2010, 3:49pm

We know how long a DNA strand is right? How many pairs it has? I have heard that the strands them self are quite quite long. Each pair on a strand is binary right since it could be either 1 of 2, AT or GC so how much information in bits could we fit on a DNA strand?

Chessic_Sense · August 19, 2010, 3:58pm

No, a pair can be one of four. AT is not the same at TA. Only one strand is read by the transcriptase, so you can ignore the other “complementary” nucleotide. So the meaningful nucleotide can be either C, T, G, or A.

GilaB · August 19, 2010, 4:04pm

DNA strands vary wildly in length. There are 23 different human chromosomes, for example, which range in length from 51 million to 245 million base pairs. (Taken from here.) There are DNA-based viruses that have DNA strands that are as short as 3200 base pairs. We can make relatively short strands, but right now, my understanding is that we can’t easily synthesize something as long as a human chromosome.

It’s not really binary. Even though A pairs with T, and G with C, DNA reading AGGCTC (or TCCGAG on its paired, reflecting strand) says something totally different than something reading TGCGTC. (The first one says Arginine Leucine, and the second one says Cysteine Valine, for what it’s worth.) If A and T were both 0, and G and C both 1 (in your proposed binary coding scheme), the two little DNA codes would say the same thing.

ultrafilter · August 19, 2010, 4:20pm

The naive encoding requires two bits per letter, but you can do slightly better with more sophisticated schemes. This paper seems to suggest that on average you need about 1.7 bits per nucleotide, but there’s significant variation from sequence to sequence.

Chessic_Sense · August 19, 2010, 4:51pm

You can use 5 bits to represent each of the 20 amino acids plus a start and stop. That replaces 3 base pairs, so that’s 1.667 bits per pair. But that’s getting away from the OP, I guess.

mr.jp · August 19, 2010, 5:06pm

The information contained in DNA is not just it’s protein transcript though. It is also various RNA products, promoter and enhancer areas, etc.

Quercus · August 19, 2010, 5:34pm

Right. There are also sections that are pretty clearly junk (like a damaged gene that has been replaced by another copy somewhere else), sections that don’t code for anything, but seem to be important for the shape and structure of the chromosome, and sections that really don’t seem to do anything at all (but maybe they do and we just don’t know what). It’s kind of like a computer hard drive: there’s a lot of ‘data’ data, plus a bunch of header information for each file, information about the disk directory and file structure, unused space at the end of files that isn’t data but can’t be used by any other file, space that isn’t used but is marked as allocated because of an error in the file system, backup file data, a permanently allocated swap file, etc. You can decide some part of that isn’t real ‘data’ and not count it, but where you draw the line is arbitrary; the only objective number is the total disk space minus the free space for files. For DNA, the only real hard number we can give is 2 bits per base pair.

Except in a real chromosome, there is more information than just the base pairs: DNA can be modified with, for example, methylation, that changes the chemical structure of base pairs, as well as other various modifications that may be chemical changes to the DNA molecule or may be changes to the supporting structures and physical geometry of the DNA. Difficult to impossible to put hard numbers on this. One could draw the line at chemical changes to the DNA molecule itself and add one bit per bas pair for methylated/non-methylated, I suppose, though I’m sure that’s overestimating the actual information content.

Chronos · August 19, 2010, 5:41pm

I imagine that you could get a reasonable estimate by taking a database of a complete human genome and running it through any of the standard compression programs.

And while different nucleotide sequences can code for the same amino acid, I would still recommend counting them as distinct for information-counting purposes. There are at least some cases known of overlapping genes at different phases, and even if two codons code for the same amino acid, they don’t have the same set of things they can easily mutate into.

Lemur866 · August 19, 2010, 5:41pm

Right, but there is some redundancy in the actually existing system. Lots of amino acids are represented by more than one triplet codon. Here’s a pretty good diagram: http://upload.wikimedia.org/wikipedia/en/d/d6/GeneticCode21-version-2.svg.

While you could compress things down to just represent the 22 end states (20 amino acids and stop and start), that’s a lossy compression since you can’t get back the exact nucleotide sequence.

ultrafilter · August 19, 2010, 5:51pm

Apparently most of them expand it (cite).

Whack-a-Mole · August 19, 2010, 6:39pm

This article admits it is simplifying some but probably good enough for a ballpark number:

The human genome is estimated to contain some 3 billion base pairs - so (again, simplifying) 3 billion bits = 0.35 gigabytes - so each cell in our body encodes roughly a third of a gigabytes of information…

<snip>

Estimates for the number of cells in the human body range between 10 trillion and 100 trillion (see Sears CL. 2005. A dynamic partnership: Celebrating our gut flora. Anaerobe, Volume 11, Issue 5, October 2005, Pages 247-251).

The generally accepted figure is 100 trillion cells - so, given each cell contains 0.35 GB of data, the (very) approximate amount of data held in human cells is 35 trillion gigabytes, or 34,179,687,500 terabytes of data, or, expressed in megabytes 3.58400 × 10^16 megabytes!!

SOURCE: http://www.utheguru.com/fun-science-how-many-megabytes-in-the-human-body

ETA: Someone in the comments responded with the following so modify the above number up:

Chronos · August 19, 2010, 7:06pm

That is truly fascinating. I would have expected a genome to be significantly distinguishable from random.

GilaB · August 19, 2010, 8:24pm

The vast majority of our DNA (something on the order of 98%) is what used to be referred to as ‘junk DNA,’ ie it doesn’t code for proteins. We’ve realized more recently that (as mr. jp referred to above) a lot of the noncoding regions are doing other stuff - it regulates the expression of coding regions, codes for RNAs that do specific jobs aside from being the messengers for coding DNA, etc. Still, the majority of DNA really does seem to be useless, as far as we can tell right now - old viruses that incorporated themselves in and made many copies, nonfunctional genes that used to code for something (pseudogenes), stuff that jumps around in the genome inserting more and more copies of itself, etc. You can get a ton of randomness in all of that stuff, because there’s no selective pressure forcing it to make any sense.

Smeghead · August 19, 2010, 8:47pm

This reminds me of one of the more interesting threads here on the Dope, wherein it was attempted to calculate the bandwidth of the average ejaculation.

friedo · August 19, 2010, 8:52pm

Clearly God has the best lossless compression algorithm.

Chronos · August 19, 2010, 9:18pm

See, that’s the part that surprises me. Something that ends up making a whole bunch of copies of itself isn’t random. There’s no selective pressure pushing it to make sense, but it should start off with a certain sort of sense to begin with. It’s easy to compress something consisting of a bunch of copies of something relatively short.

ultrafilter · August 19, 2010, 10:58pm

If you look at the paper I linked to, there are algorithms that will compress genetic information, but they’re not standard. It’s also worth keeping in mind that randomness isn’t tied to how well any particular algorithm can compress a sequence; it’s tied to how well the best algorithm for that sequence compresses it.

lazybratsche · August 20, 2010, 12:34am

That’s true, but there’s no selection pressure to prevent these repetitive elements from mutating, so they accumulate changes as fast as any other useless bit of the genome. There’s enough similarity to identify the sequences, but that doesn’t have to be a lot. The “junk” copied elements have been accumulating for billions of years. The oldest have probably mutated beyond recognition, and only the most recent insertions will actually represent functional, identical copies.

AnnaKareninja · August 20, 2010, 7:50am

If only considering genes:
about 30% of genes are known to undergo intron/exon rearrangement, where different bits of mRNA code are cut out after transcription to to make new combinations for translation into polypeptide chains.
Therefore, one portion of DNA sequence can contain more information than just the sequence along its length.

A lot of the DNA molecule is “junk” as mentioned above, but we do use information from the junk (short tandem repeats, minisatellites) in DNA profiling and fingerprinting. Using gene sequences to differentiate would be pretty useless, as our genes are extremely similar, between individuals.

So basically, what do you count as information, and what doesn’t?

EdwardLost · August 20, 2010, 10:11pm

The genome is the ultimate in spaghetti code. As pointed out above, even “useless” repetitions and mutations will have effects on the shape of the chromosome and relative positioning of other genes and can affect the phenotype. As we continue to study genetics I believe we will find that many of the differences between individuals will be traceable to differences in the “junk” DNA.

Topic		Replies	Views
How many MB or GB or TB to hold my DNA? Factual Questions	34	3663	May 14, 2007
how many gigabytes is human DNA Factual Questions	19	7243	December 10, 2009
Information Encoded in DNA. Factual Questions	5	863	September 19, 2017
Evolution and "jump theories" Factual Questions	31	2979	February 22, 2009
Explain DNA to a Stupid Person, Please? Factual Questions	22	1605	July 23, 2004

How many bits of information are contained in a human DNA strand.

Related topics