How many MB or GB or TB to hold my DNA?

Oh, yes, good call. Basically the 4x IO improvement ntucker was talking about (though I missed it before, he seems to have beaten me by two minutes to my last post, and put forth my argument better, to boot).

It seems this would be correct, up to the assumption that every possible strand of DNA is equally likely. Taking into account the fact that they very much aren’t, and that most human DNA is apparently almost constant, and calculating the Shannon entropy accordingly, the number one gets will be significantly less. But the OP did say “uncompressed nucleotides”, and so the 1.5 number is probably what they want.

But we know that there are five bases, since urasil is found in RNA instead of thymine. Even with that, though, ASCII has way too much space.

You know what’d be funny? Storing DNA sequences in UTF-32…

Well, that’s just a matter of using the same four bases encoding, and putting a single bit at the start of each file (or at whatever the appropriate level would be) indicating whether we’re dealing with DNA/thymine or RNA/uracil.

Must be an old article. 3 Gigs is nothing today. New machines often have that much RAM, fer Chrissakes.

Open a 3 gig text file in Word and then say that.

I find it hard to believe that they actually store sequences using a full byte to store each base. That is a monumentally stupid approach.

The only possible advantage I can come up with is that it’s a widely compatible format and human readable. But it’s not really either, when you take into account a file size of multiple gigs. That limits your choice of viewers drastically, plus the fact that a human going over a few billion characters is going to produce very little meaningful analysis.

There is ASCII, ANSI, Unicode…why not a universally accepted DNA data format? That should have been the very first thing they set up. Reading ASCII data is slower than reading binary data. Someone mentioned that mapping tables would have a trivial impact on performance if it were stored in true binary format; I feel compelled to point out that ASCII data is itself a mapping table approach, and one with a ton more overhead than a UniDNA or whatever you’d call it would have.

I don’t get it.

Incidentally, Unicode text is a preferred format over ASCII, since it handles pretty much all international characters, instead of just the sampling that ASCII supports. To handle all those characters, Unicode uses two bytes per character instead of one, making it double the size of the same text in ASCII format.

For those who think the ASCII approach is reasonable, would you feel it equally reasonable to use Unicode instead? If your reaction is “that would be ridiculous”, that’s how the programmers feel about using ASCII to begin with. (Except instead of Unicode doubling the size for no apparent reason, using ASCII quadruples the size for no apparent reason.)

As I understand it the variation thats causes the 3% differance is not in the same place in everyone’s genome. So no.

Well, alright, but you’d still get substantial savings all the same. Think about it this way: what’s likely to be shorter, listing out your entire genome, followed by my entire genome, or listing out your entire genome, followed by an efficient description of the manner in which my genome differs from yours? If our genomes are 97% the same, then the second will be more efficient, by a large shot (just say “Oh, over here we differ like this, and over there we differ like this, etc.” for those portions where needed). So, as Rysto was saying, once you pick something to be “normal” DNA, you can store whole genomes efficiently by just storing efficient descriptions of their deviation from this. In general, the more predictable data is, the less information it contains, and thus the more savings possible by efficient encodings. If people’s DNA generally overlaps about 97%, whether or not in the same areas all the time, then it’s very predictable.

How is this relevant, when we are talking about storing only the letters A, C, T, and G?

This is not true. You are confusing the character set with the encoding.

Fair enough, but I think her points will still stand if you pretend she had explicitly said “UTF-16” or something like that.

I would go for the 2 bit encoding of the four bases, not so much to save space, rather because files over 2Gb in length are a bit trickier to handle

  • even MSDOS supports a 32 bit seek and a 32 bit file size

FWIW, Prospect magazine gave away a CD with a compressed version of the draft sequence as a freebie stuck to the cover of their October 2000 issue. Pretty much just to make the point that they could.

French people’s DNA uses the C with the little tail on it.

And Germans have an Umlaut over the A (to maintain linguistic compatability A as in cave rather than A in cat) :slight_smile:

And Russians can’t interbreed with Europeans because their DNA is in Cyrillic

I think this thread is going to go downhill …