Long rambling post that will KILL this thread…
Just got a PhD in genetics last week so let’s give this thing a test spin. It certainly is easier with Y-haplotypes and mtDNA, and has been done for these on the population level (since these change slowly). In fact, IIRC Cavalli-Sforza used these to show differentials in female versus male population migrations in the South Pacific and the Iberian Peninsula/British Isles. Very interesting stuff.
Technology on the biology side, will not be limiting. There are techniques that will be available within a few years to perform megabase-level sequencing without much trouble (and by that I mean that you won’t need a Genome core facility if you want to sequence an organism). We are already gearing up to do allele mapping from mutagenesis screens by sequencing entire BAC clones (around 50 kb per BAC, sequenced to 10x coverage is half a megabase).
I suspect that it may be one of these “infinitely long” computer problems to solve, though. It amounts to finding a best-fit tree for let’s say at least 10,000 polymorphisms/haplotype markers per individual, first by looking for nearest neighbors and then arranging the nodes. Given the messiness of human breeding (multiple wives, questionable paternity, etc.), this would need to be incorporated into a very flexible tree building algorithm. Millions, if not billions, of trees will be generated, and math above my head will be used to determine the most probable answers.
In the end, with all of the randomness introduced by human gene flow, coupled with complications like meiotic recombination, variable haplotype size, drift, and selection, would make it very very difficult to build anything resembling an accurate family tree. But it may be able to generate interesting trees, and identify relatives on a limited scale.
Thought experiment. Forgive me if it makes little sense. Let’s say that you sequence 10,000 polymorphisms for every person, and that a reasonably rare polymorphism happens at a rate of 1:100. That means you have 100 of these; 50 of them from each parent, and 50 of yours shared with each sibling and progeny (with no regards to their inheritance from your father or mother – meiotic recombination). First cousins, sharing 1/8 of their genome, would have around 12-13, on average.
The probability of another unrelated individual matching, by chance, at the first cousin level, would therefore be (10^-2)^12=10^-24 to (10^-2)^13=10^-26. Which is of course far more than all the humans that ever lived. Correct me if I’m wrong, I’m a biologist here.
So, if you sequenced all 6 billion people on the planet, chances are that you could statistically calculate significance to about the second cousin level – they share 1/16 of their genomes, so at a 1:100 polymorphisms (and 100 total polymorphisms), they would share around 6 polymorphisms. And so P=(10^-2)^6 = 10^-12 and still at a nearly several hundred fold more than your total individuals. I’d do a Chi-square but it’s nearly 2 AM. But at third cousin level, this goes to (10^-2)^3 or 10^-6 so you would not nearly have the resolving power. Sequence more, if you can find them, or just give up.
I think my polymorphism numbers (1:100 rate and 10,000 total) are reasonable, perhaps a little too optimistic. Given these, you would be able to link together second cousin nodes but anything beyond would be worthless. Going farther means identifying and sequencing more polymorphisms, and at some point you just run out of appropriate ones in the genome. For every extra cousin step out, you will need to double the amount of polymorphisms you look at: given my numbers, 20,000 for third cousins, 40,000 for fourth cousins, 80,000 for fifth cousins. I think you could just run out before you get there, thus inherently limiting your analysis.
That’s my slightly more than 2 cents.