How are endogenous retroviruses identified in DNA?

Well, that’s my question. Having read about endogenous retroviruses at talkorigins.org, I found the topic very interesting.

However, I was wondering how they are discovered. Suppose you sequence a section of DNA – how do you know which sequences are retroviruses and which are random junk or other unidentified material?

Just compare the sequence to known retroviruses. Genbank has a special page just for them: http://www.ncbi.nlm.nih.gov/retroviruses/

I suspect that after millions of years of evolution, it’s likely that none of the ancient retroviruses are identical to the ones that are active today. So how are they identified? Are there numerous similarities between the ancient and modern viruses that make their identification certain? What are these criteria?

There are several algorithms scientists use to compare sequences. One of the most common and widely accepted is called BLAST (Basic Local Alignment Search Tool). There are many variations, but in general the algorithm compares a given a query sequence (such as a retroviral sequence) with a database of sequences (such as the human genome). Statistics indicate how similar the query sequence is to the target sequence (e.g., 98% similarity) and how probable a random sequence would match the query sequence. The latter statistic is arguably more important. As the query sequence becomes longer and more specific, the probability that a random sequence of nucleotides will match that sequence becomes smaller.

You can even do this search yourself:

[ol]
[li]Take a sequence such as this one for a human endogenous retrovirus: Human endogenous retrovirus H HERV-H/env59 proviral copy, clone 916F3 - Nucleotide - NCBI. You’ll have to scroll down a bit to see the actual sequence itself.[/li][li]See the line that says “ACCESSION AJ289711”? Copy the sequence accession number “AJ289711”.[/li][li]Go to the NCBI BLAST page: http://www.ncbi.nlm.nih.gov/blast/index.shtml. On the right hand side there’s a box for “Genomes”. Click on “Human” to search the human genome BLAST page doesn't exist the sequence accession number that you copied in step 2 into the big box that says “Enter an accession, gi, or a sequence in FASTA format”. Leave all the other options at default.[/li][li]Click on “Begin Search”. The BLAST server will do its thing. Depending on how busy it is, you may have to wait several minutes (the SDMB hamsters have nothing on the BLAST hamsters). Click on “Format!” to see your results.[/li][/ol]

If you scroll down a bit, you’ll see how the query sequence (i.e., the retrovirus) compares with the human genome. Note that it’s not an exact match but it’s really, really close (99% similarity). Now I cheated a bit because I started with an endogenous retrovirus already known to be in the human genome to ensure getting a hit. Researchers starting from scratch wouldn’t necessarily know whether their sequence was in the genome. You may wish to compare the results you get with other viruses, such as this bird retrovirus: Avian sarcoma virus CT10 genomic sequence - Nucleotide - NCBI

See also this list of sequence alignment software from Wikipedia.