How many MB or GB or TB to hold my DNA?

Indistinguishable · May 12, 2007, 8:21pm

Oh, yes, good call. Basically the 4x IO improvement ntucker was talking about (though I missed it before, he seems to have beaten me by two minutes to my last post, and put forth my argument better, to boot).

Indistinguishable · May 12, 2007, 8:27pm

It seems this would be correct, up to the assumption that every possible strand of DNA is equally likely. Taking into account the fact that they very much aren’t, and that most human DNA is apparently almost constant, and calculating the Shannon entropy accordingly, the number one gets will be significantly less. But the OP did say “uncompressed nucleotides”, and so the 1.5 number is probably what they want.

Captain_Carrot · May 12, 2007, 9:03pm

But we know that there are five bases, since urasil is found in RNA instead of thymine. Even with that, though, ASCII has way too much space.

You know what’d be funny? Storing DNA sequences in UTF-32…

Indistinguishable · May 12, 2007, 9:10pm

Well, that’s just a matter of using the same four bases encoding, and putting a single bit at the start of each file (or at whatever the appropriate level would be) indicating whether we’re dealing with DNA/thymine or RNA/uracil.

Q.E.D · May 12, 2007, 9:13pm

Must be an old article. 3 Gigs is nothing today. New machines often have that much RAM, fer Chrissakes.

EllisDee · May 13, 2007, 5:54am

Open a 3 gig text file in Word and then say that.

I find it hard to believe that they actually store sequences using a full byte to store each base. That is a monumentally stupid approach.

The only possible advantage I can come up with is that it’s a widely compatible format and human readable. But it’s not really either, when you take into account a file size of multiple gigs. That limits your choice of viewers drastically, plus the fact that a human going over a few billion characters is going to produce very little meaningful analysis.

There is ASCII, ANSI, Unicode…why not a universally accepted DNA data format? That should have been the very first thing they set up. Reading ASCII data is slower than reading binary data. Someone mentioned that mapping tables would have a trivial impact on performance if it were stored in true binary format; I feel compelled to point out that ASCII data is itself a mapping table approach, and one with a ton more overhead than a UniDNA or whatever you’d call it would have.

I don’t get it.

EllisDee · May 13, 2007, 6:13am

Incidentally, Unicode text is a preferred format over ASCII, since it handles pretty much all international characters, instead of just the sampling that ASCII supports. To handle all those characters, Unicode uses two bytes per character instead of one, making it double the size of the same text in ASCII format.

For those who think the ASCII approach is reasonable, would you feel it equally reasonable to use Unicode instead? If your reaction is “that would be ridiculous”, that’s how the programmers feel about using ASCII to begin with. (Except instead of Unicode doubling the size for no apparent reason, using ASCII quadruples the size for no apparent reason.)

essell · May 13, 2007, 9:52am

As I understand it the variation thats causes the 3% differance is not in the same place in everyone’s genome. So no.

Indistinguishable · May 13, 2007, 10:12am

Well, alright, but you’d still get substantial savings all the same. Think about it this way: what’s likely to be shorter, listing out your entire genome, followed by my entire genome, or listing out your entire genome, followed by an efficient description of the manner in which my genome differs from yours? If our genomes are 97% the same, then the second will be more efficient, by a large shot (just say “Oh, over here we differ like this, and over there we differ like this, etc.” for those portions where needed). So, as Rysto was saying, once you pick something to be “normal” DNA, you can store whole genomes efficiently by just storing efficient descriptions of their deviation from this. In general, the more predictable data is, the less information it contains, and thus the more savings possible by efficient encodings. If people’s DNA generally overlaps about 97%, whether or not in the same areas all the time, then it’s very predictable.

psychonaut · May 13, 2007, 10:36am

How is this relevant, when we are talking about storing only the letters A, C, T, and G?

This is not true. You are confusing the character set with the encoding.

Indistinguishable · May 13, 2007, 10:55am

Fair enough, but I think her points will still stand if you pretend she had explicitly said “UTF-16” or something like that.

FRDE · May 13, 2007, 12:07pm

I would go for the 2 bit encoding of the four bases, not so much to save space, rather because files over 2Gb in length are a bit trickier to handle

even MSDOS supports a 32 bit seek and a 32 bit file size

bonzer · May 13, 2007, 12:58pm

FWIW, Prospect magazine gave away a CD with a compressed version of the draft sequence as a freebie stuck to the cover of their October 2000 issue. Pretty much just to make the point that they could.

spinky · May 13, 2007, 6:13pm

French people’s DNA uses the C with the little tail on it.

FRDE · May 14, 2007, 10:16am

And Germans have an Umlaut over the A (to maintain linguistic compatability A as in cave rather than A in cat)

And Russians can’t interbreed with Europeans because their DNA is in Cyrillic

I think this thread is going to go downhill …

Topic		Replies	Views
Read my DNA Factual Questions	9	1373	October 26, 2005
Can someone teach me genetics 101? Factual Questions	28	2573	November 18, 2002
Would aliens have DNA? Factual Questions	25	1482	August 8, 2003
How many bits of information are contained in a human DNA strand. Factual Questions	19	10380	August 20, 2010
The Average Human Mind as a Computer Factual Questions	18	1308	September 18, 2004

How many MB or GB or TB to hold my DNA?

Related topics