Is DNA "searchable"?

I asked a question yesterday regarding matching adoptees to parents via DNA searches and given the paucity of responses I realized I had phrased it poorly and in a confusing manner.

So let me restate the question. Given current technology, is there enough relevant information in the type of DNA record typically kept on file, to allow searching, cross matching and matching of children to parents, given a searchable database of million to hundreds of millions of individual genome indexes.

In other words, is DNA parseable and “searchable” in the context of being able to match one child’s DNA to a specific parent’s DNA out of millions of possible matches?

No. AFAIK, there isn’t even a national database of felons’ DNA, let alone a database of everybody ever tested for any reason. This is a good thing, given that I don’t want my DNA searchable by any random person if I decide to get tested for something, like a breast cancer gene.

No. The human genome project depended on the DNA of only a couple of people. When we refer to the genome being sequenced, for all intents and purposes, we can pretty much say that the genome of one person has been sequenced.

Much of the genome stays the same from person to person. So we don’t need to sequence every person’s genome – that would be a huge expenditure. Instead, we aim to categorize polymorphisms, or the parts of the genome that are different from person to person. There are many polymorphism projects going right now, in which they attempt to track down the most common polymorphisms by looking for them in many different people.

DNA fingerprinting, used to identify people, is usually just analyzing a suite of the polymorphisms. Pick a few dozen different polymorphisms, and one can be sure within billions that two DNA samples are the same. Or, looking at relatives’ DNA, one can establish relation. This is the basis of paternity tests, crime scene testing, and the like.

I am sure that the polymorphism projects are being very careful to discard personal information associated with each sample. This is due to a lot of factors, genetic privacy concerns being one of the biggest. As GilaB voiced, this is unlikely to change anytime soon, except maybe for felons in order to create and FBI genetic database similar to the fingerprint database they now use.

Now in terms of sheer searchability, there are hundreds of different ways to search the gigabases of publicly available sequence. These include genome browsers (my favorite at http://www.ensembl.org/) to DNA/DNA or protein/DNA search engines like BLAST (http://www.ncbi.nlm.nih.gov/BLAST/) to organism-based sites (http://www.flybase.org/).

So… if there was a comprehensive public record of these individual DNA polymorphisms extant at some point in the future, it would be possible to take a the data file of a child’s DNA polymorphism(s), and cross match it against the data sets of millons of possible parents until you found a parental match for a mother or father.

Thanks for the info!

As far as I know, DNA ‘fingerprinting’ doesn’t involve any actual sequencing – it’s a method of searching for patterns in a person’s DNA called VNTRs (Variable Number Tandem Repeats). DNA fingerprinting involves finding and sorting the VNTRs a given person has according to size. Three important things about VNTRs:

  • They vary in size, so they’re easily sortable by gel electrophoresis, a process that sorts fragments of DNA (or RNA or protein) according to size and weight.
  • They vary between people enough so that the VNTRs a given person has can be used to identify them.
  • They’re inherited genetically, so the VNTRs a person has can be used to determine the relationship between them and another person, such as in paternity cases.

Since the VNTRs are essentially ‘garbage’ DNA (at least in our current understanding), there’s no point in sequencing them. Each test is performed in isolation – for example, a child’s DNA might be compared to a potential parent, or DNA found at a crime scene might be compared to the DNA of a potential suspect. The test can include or exclude relationship between the samples, but it doesn’t provide any useful, searchable data.

Of course, you could always record an image of the ‘fingerprint’ (the Southern blot produced by separating the VNTR fragments), and somehow enter the data into a searchable database. However, the distance traveled by each band on the fingerprint, the distance between bands, the size of the bands, and so on, are all characteristic of the conditions used to prepare the Southern blot. Recording the data in a searchable form could only work if you ensured that the conditions for the Southern blot (voltage, time, salt concentration, gel type, etc.) were identical for every DNA fingerprint in the database. If you really wanted to make a database (which raises significant ethical questions), you would have to sequence the VNTRs to determine their size (the ‘variable number’) and their sequence, then make that searchable. So it’s not something to worry about quite yet.

The existing searches which edwino mentioned – there are others, but BLAST is enough for anyone interested in genomics non-professionally – are primarily databases of introns, of DNA which actually encodes for proteins, not ‘garbage’ DNA like VNTRs. In any species, there are at most a few different possible sequences for each intron; some genes have the same sequence in all members of the species, except those with a genetic disease.

So, for the most part, the existing genomic databases would be useful mostly to determine what species a sample comes from, not to identify which individual. A search could also be used to identify genetic diseases, even considering the information we have now. But a searchable genetic database similar to the fingerprint databases currently in existence may not arrive for some time.

So even if we didn’t know the potential parent, the childs DNA can be compared against millions possibilities until a match is found. Will this match using DNA like this be definitive, ie this is one of the parents, or simply, this could be the parent?

VNTRs and STR (short tandem repeats) and SNPs (single nucleotide polymorphisms) and the like can be categorized by sequencing. The SNP project is aiming to do this (we should invoke Tars Tarkas at this point), and it would be possible to categorize a person by a suite of polymorphisms based on sequence and not band migration. Due to their nature, SNPs would be probably be easiest. If you pick enough, you can nearly always determine identity. To 100% determine parentage, you may need a sampling from both sides of the family (mother and father) as well as the child.

It is certainly possible to establish a database to identify people based on their SNPs. I don’t imagine it would be very complicated, nor do I imagine that it would be tremendously expensive. At first, it could start as a database for felons and for parents who want to establish records of their children, much like the fingerprinting and dental databases now around. Binary SNPs could be established without direct sequencing, using PCR based around allele-specific oligonucleotides (ASOs), at cents a reaction. Let’s say one categorizes 50 SNPs per person, which probably would be sufficient to establish a certainty to the billionth. I can imagine one could get the cost at about ten bucks total per person.

This ignores the manifold bioethical problems. Certain SNPs are being associated with disease-prone haplotypes. Putting SNP status on record could be used against people – if you are 15% more cancer-prone according to a haplotype given by a couple of base changes in your DNA, there are good economic arguments for insurance companies to want to limit your coverage. I can imagine tremendous resistance to such a database. Unlike fingerprinting and dental records, more direct disease and health correlations can be made from some SNPs. Arguments along this route are extremely good ones against such a database, and IMHO heavily outweigh arguments for it.

Yes, but . . .
given the amount of info contained in DNA, and the population of the country, the parents (and probably the child, too) would likely have died of old age before a computer finished searching such a database!

Additionally, the legal trend these days toward protecting medical privacy would make this extremely unlikely. There are very strict regulations in place regarding the protection of patients’ medical information - who exactly can see what, etc. Before anyone could be added to the hypothetical database, they’d have to give legal informed consent. In the end, the only people in the database are adpotees and parents who want to be found, which I suspect would be a fairly small segment of the population.

Let me give a real-world example:

I’m working on hunting down possible “cryptic” genes in a known sequence of a mere 300,000 or so bases length. I have gotten a parallel supercomputer to puke on the problem when I tried a simple “search” approach and had to break it down before it would work.

Thus, not only are we short of basic information, but we also lack the computing power at all but the best facilities to even make a reasonable attempt, and even then it would take somebody with experience in such problems to do it efficiently.

What programs are you using Dogface? I have partially annotated around 850 kb of DNA that I sequenced through the Drosophila 3.1 pipeline on an old-school 400 Mhz G4. Took a few hours to do it all – gene prediction, repeat masking, the whole shebang.

I have converted this and other sequence into BLAST libraries and searching against it takes only a few seconds on the same computer. Searching against the whole Drosophila genome, done locally, takes usually around 30 seconds (120 Mbps or so of sequnece). Even some truly inefficient Perl programs I wrote to do intersective BLAST searches between 4 different species homologs of the same genes takes around 20 minutes per gene, and that is if I am using greater than 2 megabases of sequence at a time.

A simple directory of SNP status could be very small. If we are just talking about binary SNPs, then one could design it to have only a few bytes per entry. These could be organized hierarchically. Let’s say we had a few 100 million entries. Bet on a few gigabyte total for database size. That’s approximately the size of the human genome data set, and a fraction of the size of the NCBI nr BLAST data set. Especially with increased computer power due to Moore’s law, I can’t imagine searches could take more than a few minutes, maximum.

Who wants the entire sequence on their computer?

http://www.ibiblio.org/gutenberg/cgi-bin/sdb/t9.cgi

It’s utterly massive, but it looks like its all there.

Search on “Human Genome”.

FrameSearch and too few processors–SwissProt database. I now have more allocated resources. At the time I got a Sun cluster to gag on the task.

Is it necessarily a good idea to do the search on computer? Couldn’t one do a “search” in vitro, by laying out half-ladder strands of the DNA to be searched, synthesizing the complement of the sequence we want to find (tagged with radioisotopes), and seeing what it bonds to? There would be a fair amount of time and money spent on the overhead to do this, but once set up, you should be able to search millions of strands very quickly.

That would be fine, if I were looking for proteins that bound to the DNA. I’m not looking for that. I’m looking for cryptic proteins within introns. It’s cheaper to start with computer models than lay out a lot of money for a lot of plates and the extra personnel (or a robot) to do the experiments. Go for the more likely predictions first–it saves grant money.

Chronos
That type of study is called chromtin immunoprecipitation (ChIP). The problem is that with any kind of immunoprecipitation, you have to jump through quite a lot of hoops to prove that something is actually binding and it is not an artifact of your reaction conditions.

I don’t know if ChIP has ever been done en masse like that, but there is a conceptually similar type of reaction called SELEX, in which a bunch of different DNAs are amplified by PCR, then applied to a binding test (usually run over protein affixed to a column). The DNAs which stick are used for another amplification, then another binding assay. Repeat until you have a small pool of DNA, and that is what binds to said protein. Since there are usually many different DNA bits that bind a particular protein, you will have to sequence a bunch of these to determine range and binding affinity.