Read my DNA

I’ve got a whack of ignorance that needs fightin’… Genetics. My ignorance is preventing me from appreciating the reports of genetic research projects, esp. genome projects:

Here’s what I think I know - please tell me where I’m wrong…

DNA is where information is stored. This information is encoded into DNA using the four “letters of the genetic alphabet” G-T-A-C (these letters representing names I forget, ending in -ine I think). Pairs of these letters form rungs on a twisted ladder - that ladder being a DNA molecule.

The ladder is “understood” by grouping the rungs. Each sequence of rungs (base-pairs) forms a word? paragraph? called a gene. So one DNA molecule contains many genes. Each gene describes a particular aspect of the species, i.e. this gene describes hair colour, that gene describes liver size, etc.

The DNA molecules are housed in a structure called a chromosome, being an organized collection of DNA molecules along with associated proteins (whatever they’re for - maybe the glue that holds it all together). Each species has a set of chromosome pairs (duplicated for redundancy?). Humans have 23 pairs, dogs have 39 pairs, horses 32, cows 30, etc.

I’m not exactly sure what a genome is, but I believe it’s an encompassing term that applies to all of the above on a species per species basis, i.e. the human genome is such that 23 pairs of chromosomes are organized in this way, having this gene which does this at this location, and that gene which does that at that location, etc.

With the cite at this site in sight, what does it mean, exactly, that a chimpanzee’s DNA differs from human DNA by 1.2% - 2.7%? The article mentions base-pair comparisons vs. large sequence comparisons. I don’t grok it. Does base-pair analysis mean they lined up all of the rungs in the ladder of each species side by side like this, then counted the differences?



      123456789ABCDEF
      ---------------
Human GGATTCAATGAGGCT
      ATCGATTTAGCAGGC
            |    |
Chimp GGATTCGATGATGCT
      ATCGATCTAGCCGGC


Assume for illustration that the entire sequence for each species is exactly 15 base-pairs. The two differences shown above (rung 7 and rung C) amount to a 2/15 = 13% difference. Is this, essentially, the type of analysis that gives us the 1.2%?

Lets say that base-pairs are grouped into the following genes (this makes sense only if what I think I know is correct):
For the human:
1-3 - height (3 pairs)
4-6 - ability to blush[sup]1[/sup] (3 pairs)
7-9 - eye colour (3 pairs)
A-F - everything else (6 pairs)

And for the chimp
1-2 - banana detection (2 pairs)
3-4 - height (2 pairs)
5-9 - eye colour (5 pairs)
A-F - everything else (6 pairs)

Here the genes governing common traints represent 12 and 13 base-pairs respectively. The 2 to 3 base-pairs left over map to unique traits, hence an approximate 20% difference (let’s say).

These examples do not parse because the second analysis such trait mapping disparities would necessarily result in higher base-pair disparity. Obviously there’s much I don’t understand.

I have other questions, but I’ll stop here so I can benefit from the corrections you will surely provide to my understanding…


[sup]1[/sup] “Man is the only animal that blushes - or needs to.” - Mark Twain

This part is as wrong as wrong can be. Genes, for the most part, code for proteins. At one time, “one gene, one protein” was the central dogma of genetics – we now know it’s not true, but it’s close enough. There are several proteins and other mechanisms responsible for any given trait, and the function of a protein may be different in different species.

The comparison of human to chimp DNA, AFAIK, has little to do with genes – researchers compare the raw code, and most of it is the same – hell, I think most living things on Earth HAVE to share some of their genetic code just to get the biochemical background right. Vertebrates share even more, as they have so much anatomy in common; by the time you get to chimps, you can point out many more similarities than differences.

Single-pair comparisons involve hunting down the matching sequences until you’ve accounted for everything. The more sophisticated large-sequence comparison takes into account the duplication and shuffling of short DNA sequences – it discounts the apparent matching of sequences that obviously don’t code for anything similar.

Thanks for replying!

Is the part that’s wrong that a gene is a group of base-pairs on the DNA molecule, and thus one DNA molecule contains many genes? Or is that part correct, but my saying this gene is for hair colour and that gene is for liver size is way off? If both are incorrect, what is a gene?

Are these “other mechanisms” governed by genetic information found elsewhere than the proteins they’re using? Or is it that assemblies of proteins, once created based on genetic information, are thereafter subject to environmental factors which govern the creation of a trait like eye-colour? (or likely complicated combinations of all of the above?)

How accurate, then, is my understanding as revealed by the over-simple illustration of the 15 pair comparisons?

This links to my first question in this post: if a gene is not a sequence that appears on a DNA molecule, what is it? Or if that’s what a gene is, are the sequences compared in these large-sequence comparisons groups of genes?
I guess a better question (besides “can you correct my misunderstandings”) is: What is a genome? What does it tell us? Is there one genome per species? I get the impression that once the genome is mapped we essentially have the blueprint to create an instance of that species - if only we had the construction technology available. If genes map only to proteins, is a genome then the blueprint for creating the necessary Lego blocks of a species - some assembly required[sup]*[/sup] - and now the search is on for those instructions found elsewhere?


[sup]*[/sup]batteries not included

No, you’re not entirely incorrect. A gene is a DNA sequence - a grouping of ladder rungs. One gene gets copied into one strand of RNA, which is then translated into one protein (or, more accurately, one polypeptide - but go ahead and think one protein). That’s sort of how a gene is defined. There’s a sequence at one end that says “start copying here” and a sequence at the other end that says “stop now”.

“Genome” is, like you said, just a term meaning the complete collection of all the genes on all the chromosomes. The human genome is all of the genetic information in a human.

The whole human/chimp comparison is kind of pointless in my opinion, and varies depending on how you measure it. The way it was originally done was by mixing human and chimp DNA together, melting the strands apart, then letting then recombine, so that you had some molecules where one strand was human and one was chimp, then melting them again. You can then measure what’s called the “melting temperature”, which is directly related to the number of mismatched pairs. Which is a long-winded way of saying it measures similarity at the level of base pairs. A-T vs C-G. More modern methods could look at differences only in coding regions - genes - or only at what actual genes are present, or whatever.

Still, like I say, I don’t know why people - other than evolutionary biologists - care. We also share something like 50% similarity with your average banana.

Hope that helps a little.

I should do a Columbo impression…“One more thing…”

The reason it’s not correct to say “this gene determines liver size” or “this gene determines nose shape” is because large scale traits like that will emerge from the interaction of many many proteins, each coded for by their own gene, and often involve interaction with the environment as well. When we say “this gene causes breast cancer” or “this gene determines hair color”, it’s because we’ve identified one specific part of some pathway that can vary with predictable consequences.

I’m a computer programmer and the idea of data models and information systems turns my crank. I’m intrigued by genome as complex data structure. More though, as a human, I wonder what it’s all about, where we came from, etc., and to the layman’s eyes there appears to be information or clues in what you pesky geneticists are up to. I wanna peek.

Finally, there are public debates on the role of genetic technologies that I feel helpless to contribute to, much less properly consider - questions like: Will genetic research lead to discrimination issues a la “Gattica”? Do I own my genes and the information encoded therein? What about genetically engineered food, stem cell research, cloning, etc…

Thanks for the explanation of how the base-pair comparison is done. It corrects my image of some massive two column printout, Human on the left, Chimp on the right, and a diligent white frocked scientist running a finger down the columns saying occassionally, “Found another mismatch! That’s 47 so far!” (computer was down that day)

Can we decode and read DNA like a book at that level? If so, can we draw a circle around each and every gene, knowing what the “start copying” and “stop now” patterns look like (or the SOH/EOT characters in computer-speak). Are these start/stop identifiers the same across species? within species?

This surprises me - not because I can refute it (or would want to). But isn’t performing the comparison a difficult thing to do? Why do it if it’s pointless? Wouldn’t such analysis help refine our catalog and hierarchy of species? Or am I reading too much into it? Wouldn’t wider understanding of this help squelch the evolution-taught-in-schools debates? I, for one, am quite proud to be a distant relative of that banana (and bear a family resemblance).

One thing I’m having difficulty visuallizing is: What is essential to my genetic makeup that makes me human, and what makes me just me, i.e. what static genetic structures define my species and what variables define my uniqueness within that species? I don’t know if there’s an easy-to-explain answer - or any answer at all.

That on your 46 chromosomes, you’ve got genes for bipedalism, and a big brain, and a spine, and lungs, and little body hair, and opposable thumbs on your hands but not your feet, and the facility to understand language, etc. And you don’t have genes for photosynthesis, or egg-laying, or the ability to breathe under water, or the ability to grow pinecones. The difference between humans and other forms of life, although incalculably vast, is essentially one of degree, not kind – all living creatures use the same structures and systems to code for the bodies that are produced; your DNA just orders up a different form than does an oak tree’s, or a three-toed sloth’s, or Allison Janney’s.

–Cliffy

A little more detail on the encoding: Base-pairs are arranged into “words” of three base-pairs, called codons. There are 20 different amino acids which make up proteins (the protein is a long chain of amino acids); each codon codes for one of those amino acids (some codons are for the same amino). There are only a few codons (and therefore, only a few amino acids) which can start a protein, and there is one codon (or a few, I don’t remember) which don’t code for an amino acid at all, but just mean “stop”.

The ribosome is the cell structure which makes proteins: A ribosome will scan along a piece of nucleic acid (actually RNA, but the RNA carries the same information as does the DNA, being copied from it) until it reaches one of the start codons. It’ll then go fetch that amino acid from somewhere in the cell, and start a chain. It then moves on to the next three-letter group, and fetches the amino acid that one codes for and adds it to the chain. It keeps on adding more amino acids to the chain until it gets to a stop codon, at which point it lets go of the protein and releases it into the cell. The complex intereactions of the various amino acids cause the proteins to fold into interesting shapes; it’s mostly the shape of the resulting protein which determines its function. Some proteins (called enzymes), in turn, act as catalysts for various chemical reactions which produce the non-protein chemicals the organism uses. Note that a great many of the proteins produced by the DNA function merely to support the DNA replication and translation into protein processes, and since these functions are common to all living things, the proteins for them are mostly constant as well, and therefore also much of the DNA which codes for those proteins.

The amount of DNA which codes for a protein is a gene. For some genes, the single protein produced has a significant and noticeable effect. For instance, people with type A or AB blood have a gene which produces the A antigen, a protein found in their blood cells. People with other blood types don’t have that gene, and therefore don’t have that particular protein in their blood. But for many or most genes, the proteins produced work together in many complicated ways to lead to observed characteristics. There isn’t, for instance, a single gene which says how many legs we have, that being a result of a great many genes (which also interact in other ways to produce other traits).

Please note that this is just a general overview, and I’ve elided over many complications (both to avoid confusion, and because I’m not familiar with all of them). If any other Dopers add details, take their word for it.

So aside from those cases where the protein coded for is the trait in question (e.g. type A or AB blood), traits are only indirectly encoded in the genes. If a geneticist sees a genome for a species they’ve not yet heard of, they couldn’t point to a gene or a set of genes and say, “Ah, this is a bipedal creature, and boy will it have a big liver!” While such features are there, it is only by understanding what the proteins do with themselves that these features can be read. This makes sense.


Chronos, your explanation helps me visualize “what makes me human, and what makes me me.” Let’s say a given section of the human genome is responsible for produce a set of proteins, enzymes, etc. that will start a chain reaction eventually result in the production of two legs. If we could trace the process of leg production back to those genes responsible, I would expect to find that configuration of genes in every human.

What may be different from one human to the next, is maybe the order of the genes is slightly optimized such that the process gets a head start (or some other efficiency gain) and thus one person grows longer legs. Or maybe the gene drops, adds, or shuffles a few base-pairs causing a protein folded slightly differently but still performing the same function, only a bit better or a bit worse.

Now, the way I describe the beginnings of leg making may be way off, but if generalized is this close to how all humans have the same set of traits, but each human is unique?

That’s a good question; this is a line of enquiry that geneticists have been pursuing since we started being able to sequence DNA.

If we have the sequence of a length of DNA, we can ‘decode’ it in the sense that we can predict what the sequence of the corresponding protein would be. We can do this because we know (through a series of clever experiments) what the mapping of DNA --> protein is - including the start/stop identifiers. The mapping, including the start/stop identifiers, is the same within species, and the same across large groups of species. However, there are some small differences between groups of organisms that are very differently related (eg. there are differences in the code between humans and some bacteria; also, there are differences between the ‘regular’ human code and the genes found in bits of cells called mitochondria - which tells us something very interesting about evolutionary history.) The wikipedia page is quite good.

The second part of your question is the most interesting - can we draw a circle around each and every gene…?

It depends very much on the organism you’re looking at. In some types of organism (e.g. bacteria) the genes are pretty much arranged one after the other, so it’s fairly easy to draw the circles. In others (e.g. humans), the genes are separated from one another by large regions of DNA that aren’t genes. In this case, we have to pick out the small regions that are genes from the much larger regions that aren’t. A lot of work has gone into finding methods to do this (which has borrowed heavily from techiques from computer science / string processing - you might find that kind of thing interesting) and it’s still a difficult problem.