Will DNA let us build the ultimate family tree?

Here’s a question for those who have some knowledge of DNA testing…

I’ve seen so much lately about tracing ancestry back several generations through DNA and related things like mitochondria. This seems to bring up an obvious question, but I haven’t seen anyone address it yet?

Basically, given a blood sample from enough people living today, and a big enough computer, could we build the ultimate family tree? Maybe back into prehistory?

The tree I’m thinking of would have nodes for individuals, who we would know for sure were there even if we didn’t have a name to put on it. Wouldn’t *that * produce a few surprises…

I don’t think we could get that fine of resolution. Any given person’s mitochondrial DNA is likely to be identical to that person’s mother. Occasionally, there’s a mutation, and given a statistically significant number of mutations, one can estimate the number of generations separating two individuals, but for few or zero mutations (as for close relatives), there comes to be almost nothing you can say.

Read Richard Dawkin’s essay in this book.

Among other things, he predicts that by 2050, field Biologists will have a portable DNA sequencer to determine what kind of animal they have collected.

Finally, the answer is upon us to an age-old question:

Who’s your daddy?

I remember seeing a program on T.V. (in the U.K.) called “Meet the Ancestors” where archaeologists unearthed a skeleton they carbon-dated to around 2,000 years old IIRC. They then took a DNA sample from the body and “matched” it to one of the residents of the small village where the body was found claiming that they were certain the present-day resident was a descendant of the deceased. However, i’m not sure how much of this was scientific certainty and how much was fudged to placate the villagers after digging up most of their property.

I heard something about a project to trace ancestry to determine how people migrated through the world.

I remember that program. Was it NOVA? I remember seeing it on PBS here in the states. I was under the impression that they were pretty certain the man in the town (actually the teacher of all the children who were tested, AFAIR) was the desdendant.

That would be the Cheddar Man.

Just to note: mitochnodrial DNA sits outside the nucleus and is transmitted by the mother via the ovum, so it only travels down the material line. Since it isn’t subject to meiosis it doesn’t vary except by mutation over time. While this allows you to establish that two people are on the same material line (or perhaps branch is the better term) it isn’t really possible, absent of any other evidence, to establish whether the relationship is mother-child, sister-sibling, aunt-niece or -nephew, et cetera, or even more distant relationships. Ditto for the Y-chromosone in males except that it is exhibited exclusively in males (except in rare cases).

Polymerase chain reaction (PCR) and other methods of typing genetic markers lets you establish the probabilty of a relation between two people, or a person and a sample (to a point of virtual certainty, the OJ Simpson trial notwithstanding) but isn’t particularly useful for large, evolutionary-scale variations due to the specificity of the markers the modern human genome and the sparsity of the sampling. Actually sequencing the entire genome and comparing them as a data set is a task daunting enough to make a professional cryptologist cringe. With greater computing power and adaptive methods of sequencing and matching it may become much easier, but we’re still decades down the road for being able to do it routinely enough to compare a large body of genomes.

In addition, mutations don’t happen in a linear fashion, i.e. a flipped bit here and there at a constant rate along the chromosonal chains; some places are more prone to mutation than others, and much of mutation involves deletions, inversions, translocations, and insertions by viruses. Because of this, you can’t really look at the genomes of a single generation and extrapolate back what happened when; you really need to sample over successive generations and interpolate filling in the holes to determine when changes occured. Even comparing, say, the genomes of all descendants of a set of grandparents is only going to give you an approximate idea of what Grandma’s and Grampa’s genomes might have looked like, and the looping and intersecting branches that inevitably happens over more than a few generations (never mind the Hapsburgs) makes the problem even more complex as an explicit solution.

You can, however, make some good guesses about the relationships between various members of the current generations. You might get some “nasty” results out of it, though, at least if you have any beliefs toward eugenics and “racial purity”. A single, non-deleterious mutation travels rather quickly through the gene pool even for slow-propigating human apes, and with the exception of geographically isolated tribes in remote areas it is doubtful anyone can claim a genetic exclusivity to any tribe, clan, or pool back more than a few generations.

Nor would you want to. Even lacking recessive genes for specific genetic disorders a homogenous population would still be easy prey for an opportunistic virus, bacteria, or other parasite, as the Native Americans discovered. Hybrid vigor is a good thing; purebreeds win dog shows but mutts tend to win dogfights.

Stranger

This is an interesting question. Let’s suppose we had DNA from everyone alive today. We could certainly detect siblings, cousins, and second-cousins. At some point, however you start getting a high probability of remote inbreeding: ie, your great-great…grandma on one side is the same as on the other side. This would cause some problems. I suspect it would go something like: we could get soemthing like 99.999% probability of siblings, then decreasing probability as you get further away. At some point, the probability would be too low to be considered accurate, and that probably wouldn’t take you back more than a few centuries.

I’m not sure if anyone has actually tackled this problem. There was recently an effort announced to collect DNA from 100,000 people world wide to get better resolution on human migration, but this is about populations, not individuals.

It could be done, but not in a single pass.

Certain genes would be too variable for use on the whole population… there’d be too much noise. You wouldn’t be able to tell the difference between 3rd cousins and thousandth cousins. Other genes, OTOH, don’t vary enough for use on the whole populations… there’s no resolution, entire towns are identical.

You’d have to run several trials using fairly conservative genes to break the population into big chunks. And then a less conserved gene to break those chunks into smaller chunks, and so on and so forth.

Then you overlay everything like it was the world’s largest Venn diagram and figure out the highest probability family tree.

Then you’d run a bunch of randomly selected genes through it and see if the inheritance patterns made sense in light of your constructed relational tree. Note problem areas, and run genes of appropriate variability to shed some light on the subject… modifying the tree as apropropriate.

We’ve had a pretty absurdly high amount of gene flow in the past couple hundred years, though. I doubt a tree would be the resulting shape. The relational ring algorithm used for mapping adhoc social organizations (e.g. crime syndicates) might be interesting.

The übergeek in me just wanted to point out that you can’t create a Venn diagram with more than four “simply” (circular or elliptical) bounded regions, nor does a graphical approach really lend itself to solution by computer, a la numerical methods. Better that you should formulate your data as some kind of monster tensor and feed it to something like Earth Simulator or BlueGene/L, provided you can afford the several months of computing time. Your tensor is going to use enough dimensions to make a gaggle of superstring theorists throw up their hands in dispair, though.

Stranger

Long rambling post that will KILL this thread…

Just got a PhD in genetics last week so let’s give this thing a test spin. It certainly is easier with Y-haplotypes and mtDNA, and has been done for these on the population level (since these change slowly). In fact, IIRC Cavalli-Sforza used these to show differentials in female versus male population migrations in the South Pacific and the Iberian Peninsula/British Isles. Very interesting stuff.

Technology on the biology side, will not be limiting. There are techniques that will be available within a few years to perform megabase-level sequencing without much trouble (and by that I mean that you won’t need a Genome core facility if you want to sequence an organism). We are already gearing up to do allele mapping from mutagenesis screens by sequencing entire BAC clones (around 50 kb per BAC, sequenced to 10x coverage is half a megabase).

I suspect that it may be one of these “infinitely long” computer problems to solve, though. It amounts to finding a best-fit tree for let’s say at least 10,000 polymorphisms/haplotype markers per individual, first by looking for nearest neighbors and then arranging the nodes. Given the messiness of human breeding (multiple wives, questionable paternity, etc.), this would need to be incorporated into a very flexible tree building algorithm. Millions, if not billions, of trees will be generated, and math above my head will be used to determine the most probable answers.

In the end, with all of the randomness introduced by human gene flow, coupled with complications like meiotic recombination, variable haplotype size, drift, and selection, would make it very very difficult to build anything resembling an accurate family tree. But it may be able to generate interesting trees, and identify relatives on a limited scale.

Thought experiment. Forgive me if it makes little sense. Let’s say that you sequence 10,000 polymorphisms for every person, and that a reasonably rare polymorphism happens at a rate of 1:100. That means you have 100 of these; 50 of them from each parent, and 50 of yours shared with each sibling and progeny (with no regards to their inheritance from your father or mother – meiotic recombination). First cousins, sharing 1/8 of their genome, would have around 12-13, on average.

The probability of another unrelated individual matching, by chance, at the first cousin level, would therefore be (10^-2)^12=10^-24 to (10^-2)^13=10^-26. Which is of course far more than all the humans that ever lived. Correct me if I’m wrong, I’m a biologist here.

So, if you sequenced all 6 billion people on the planet, chances are that you could statistically calculate significance to about the second cousin level – they share 1/16 of their genomes, so at a 1:100 polymorphisms (and 100 total polymorphisms), they would share around 6 polymorphisms. And so P=(10^-2)^6 = 10^-12 and still at a nearly several hundred fold more than your total individuals. I’d do a Chi-square but it’s nearly 2 AM. But at third cousin level, this goes to (10^-2)^3 or 10^-6 so you would not nearly have the resolving power. Sequence more, if you can find them, or just give up.

I think my polymorphism numbers (1:100 rate and 10,000 total) are reasonable, perhaps a little too optimistic. Given these, you would be able to link together second cousin nodes but anything beyond would be worthless. Going farther means identifying and sequencing more polymorphisms, and at some point you just run out of appropriate ones in the genome. For every extra cousin step out, you will need to double the amount of polymorphisms you look at: given my numbers, 20,000 for third cousins, 40,000 for fourth cousins, 80,000 for fifth cousins. I think you could just run out before you get there, thus inherently limiting your analysis.

That’s my slightly more than 2 cents.

A thought just occured to me: Do we have any data besides genomes to work with? Given the complete genomes for two persons, and no other data, I think one could determine that the relationship between them is parent-child. But I don’t think, given just the genomes, that one could determine which is which. Of course, this particular ambiguity could be resolved easily just by the addition of age information. So, are we allowed to use that? Or do we have to use the set of all other interconnections to determine which of a pair is parent and which is child?

And congratulations, edwino!

I have an image of you putting the degree on the table and making Ouija-style hand movements over it to generate that post. I can’t wait till I get mine and the magical powers that come with it.

edwino: Yes, congrats on the PhD. Would we be able to differentiate between half-brothers and cousins, since they both share the same number of grandparents? I’m thinking that these types of relationships would further complicate the family tree construction. Assume that you don’t have either of the parents’ DNA to compare-- just the two half-brothers and cousins.

Chronos and John Mace. Thanks for the congratulation. Now back to med school for me…

We would definitely need some pedigree/family tree/age information as well. There are a number of probable complications – half-siblings share 1/4 of the genome, to my estimation first cousins share 1/8 (You share half with you parent, who shares half with his sibling, who shares half with his child…). But for instance, you wouldn’t be able to tell an uncle from a half-sibling without grandparent information, which we are not likely to get all that often. Sampling just one or two generations at a sweep will allow us to construct simple nodes of parent/child groupings, or perhaps a cluster including everyone out to second cousins, which should stand out statistically from background (using my out-of-my-ass polymorphism data). Adding age information may allow us to get to a semblance of a family tree, but given the messiness in human breeding, there are going to be a lot of ambiguous cases, thus leading to a clustering that doesn’t resemble a family tree.

You and your siblings would be on one branch, these can probably be distinguished from your parents (if we have their data), who would be on the next branch over. Uncles, aunts, grandparents, and half-siblings a branch over from that. Then first cousins and great aunts and uncles, etc. etc. Age data may add some but there is no way to conclusively distinguish a half-sibling from an aunt. Assuming completeness of a data set, I suppose these can be arranged into family trees, but again, I have the suspicion that the deconvolution of the billions of trees formed (each individual will have their own tree) would be one of those computer problems that is solvable but only in a time frame longer than the life of the universe…