I was watching Jurassic Park the other day. When the animated Mr. DNA dude came on explaining DNA he said something like “If we looked at screens like these every second for 5 years, we’d see the whole chain.” Now i am wondering how many bytes it would take for me to store my DNA on a computer. This would be uncompressed nucleotides, and jsut the letters AGCT.
The online DNA Calculator wasn’t quite what i expected it to be.
About 3 GB to hold it all:
Q. How big is the human genome?
The human genome is made up of DNA, which has four different chemical building blocks. These are called bases and abbreviated A, T, C, and G. In the human genome, about 3 billion bases are arranged along the chromosomes in a particular order for each unique individual. To get an idea of the size of the human genome present in each of our cells, consider the following analogy: If the DNA sequence of the human genome were compiled in books, the equivalent of 200 volumes the size of a Manhattan telephone book (at 1000 pages each) would be needed to hold it all.
It would take about 9.5 years to read out loud (without stopping) the 3 billion bases in a person’s genome sequence. This is calculated on a reading rate of 10 bases per second, equaling 600 bases/minute, 36,000 bases/hour, 864,000 bases/day, 315,360,000 bases/year.
Storing all this information is a great challenge to computer experts known as bioinformatics specialists. One million bases (called a megabase and abbreviated Mb) of DNA sequence data is roughly equivalent to 1 megabyte of computer data storage space. Since the human genome is 3 billion base pairs long, 3 gigabytes of computer data storage space are needed to store the entire genome. This includes nucleotide sequence data only and does not include data annotations and other information that can be associated with sequence data.
As time goes on, more annotations will be entered as a result of laboratory findings, literature searches, data analyses, personal communications, automated data-analysis programs, and auto annotators. These annotations associated with the sequence data will likely dwarf the amount of storage space actually taken up by the initial 3 billion nucleotide sequence. Of course, that’s not much of a surprise because the sequence is merely one starting point for much deeper biological understanding!
3 gigabases is the size of your haploid (n) genome. Since you have two possibly different versions of each gene, you need the full diploid genome (2n) to describe yourself. So that’s 6 gigabases, or 6GB.
I must be misunderstanding something. Seems like each base takes two bits to encode; ergo, one million bases should take two million bits to encode, which is only about a quarter of a megabyte.
The thing about biologists is that they make terrible programmers. Much of the Human Genome Project consists of gigantic files full of ASCII A’s, C’s, G’s and T’s. And the laboratory systems are held together with a million rolls of duct tape and several thousand lines of Perl.
I would suspect that there are repeating sections as well, so potentially you could be looking at losing another 4/5ths, 7/8ths or so of size depending on how well DNA compresses.
Okay, here’s taking my original OP further: If i send in full sample of sperm to get analyzed by some DNA-decoding machine. How much SAN storage would i need to hold just my haploid?
Is that necessarily bad programming practice? You can store the bases as text and compress the data later if storage space concerns you. Gigantic files full of ASCII or XML or whatever compress very well. Using the bare minimum number of bits to store something is premature optimisation of the sort that led to Y2K problems. Text-based formats are more flexible.
We’d hardly be talking about squeezing things into an unnatural compressed encoding to use the bare minimum number of bits, though; the most natural way to encode a base is as a two bit sequence. That’s not premature optimization, that’s just straightforward. Using a full ASCII character is blowing things up to four times the necessary size for no reason; the resulting files are 75% wasted space. That’s frivolous anti-optimization.
Y2K problems were caused by picking an encoding that actually was unable to properly distinguish between certain values; that would not be an issue here. Two bits is quite well enough to encode A, C, G, and T all separately. What advantage is there to storing this data in the form of text? What problems could there possibly be from using the most natural binary encoding instead?
I suspect in the practical, everyday world the advantage is that it can be searched and parsed by a wide variety of very generic textual data handling routines. It’s not as if storage space is terribly expensive anymore.
In a custom system optimised for storing that data, and assuming there will only ever be four bases, yes, two bits per base would be just fine.
Whatever language you’re using to manipulate the data will have native or near-native APIs to read and write data in a file in ASCII. Storing it in any other format will involve compressing and decompressing it in addition to reading and writing it, and no matter how fast you make the compression/decompression algorithm, it’s going to take a little while to run it on 3 billion base pairs. Storage space is cheap these days, and DNA analysis is slow enough already, so why not optimize for speed as much as you can?
Well, some 97% of human DNA is the same in all of us, isn’t it? So you could get substantial savings by storing your DNA as diffs against “normal” DNA.
You can’t seriously be arguing that this “compression” takes any significant amount of CPU time when compared to the IO. It’s a 1:1 translation that saves you 4.5 gigabytes of IO, for god’s sake. In fact, if you’re really concerned about the translation perf (which you should not be), your read code gets really fast: you read 1 byte and do a table lookup to get 8 characters. Holy cow! So on the read side, you get about a 4x IO improvement at a processing cost of … I’m going with zero, whereas on the write side you get about a 4x IO improvement with a .0000001% (WAG) processing cost. Hmmm. Yes, I’m agreeing with “bad programmer” for someone who stores the data as ASCII and wonders why his DNA analysis takes a long time.
Further, the idea that you have to do any translation at all is kind of silly if your goal is optimizing for speed. Manipulating the data in its natural, more efficient format is going to involve touching 75% less bytes in memory, which will be faster. Anyone who’s done graphics optimization could tell you this: less data touched == faster.
[note: one scenario I think is probably fairly likely is that they don’t actually store the data as ASCII on disk, and that was an oversimplification somewhere along the chain of turning facts into PR]
But this is a tangent. The answer to the question is that there’s 1.5 Gigabytes of actual information in your DNA, as far as I can tell. Correct?
But I wasn’t talking about compression like LZW or Huffman coding or anything. Just simple straightforward fixed character encodings, the natural ones. Instead of an 8-bit encoding of 01000001 for A, using 00 instead. Instead of 01000111 for G, using 10 instead. That’s all. Read in each byte as four base pairs, output four base pairs to each byte. The speed hit… that’s not a speed hit, that’s nothing. Storage is apparently way too cheap to worry about such things, so, alright, whatever, but it still strikes me as odd to have a 6 GB file, 4.5 GB of which are pure junk (basically on the level of the first three bytes out of every four bytes being guaranteed to be 0, yet writing them down all the same), when all one would have to do is locate one’s language’s binary I/O libraries instead of its text I/O libraries to do much better. But if storage really is all that cheap, I guess it doesn’t really matter.