How does anti-virus software work?

I tried to look this one up on How Stuff Works but I didn’t see anything. What I am wanting to know is how does anti-virus software work? I have seen the software in action as it scans every file on my hard drive to look for viruses. Does the software compare each file against a database of known viruses and look for specific strings of code which are indicative of a virus? Does the database (virus definitions) include all or most known viruses going all the way back to when viruses were first known? For example, would an up-to-date version of Norton Anti-virus still catch a file with the Michelangelo virus? (This particular virus got a lot of news coverage in 1992). It goes through each file so fast that I have to wonder how it can do it so quickly if it has to compare the contents of each file against a huge database of known viruses.

It compares each file to the virus definitions (and yes, most lists go back to the first viruses; that’s why they’re so big). However, for many file types, it only needs to check a small part of each file. One or two lines of code is enough to identify a particular virus.

It is a clunky way to do it. In the beginning, there were two types of antivirus – those with virus definitions, and integrity checkers. Integrity checkers would check if any of your files had been altered and fix the alterations. When tested, they were just as good as those with virus definitions, but they got worse reviews. Why? Because they would say “found an unknown virus and cleaned it” while those with lists would say “found Michaelangelo and cleaned it.” Because those with definitions could name the virus, they were deemed better, when actually they were worse – if the virus wasn’t in the definitions, you were out of luck. That’s why you need to update your antivirus definitions.

I would say that they may eventually drop some old viruses from the definitions, but only very reluctantly. If someone got infected by, say, stoned while they had antivirus running, the bad press would hurt sales.

I’m sure that the situation has gotten more complex by now, but in the old days the virus-detector programs would examine each executable file looking for tell-tale sequences of hexadecimal characters identified as belinging to specific viruses. You would download Virus-Scanner “update” files periodically to update your virus-scanner program - those update files were actually data files containing the latest set of those character sequences. Unless a particular character sequence was identified as a false alarm it would never be removed from those files, so once a virus sequence was identified it would be looked for forever.

…but then viruses got more complicated (they’d specifically be designed to morph as they spread), so things may well have changed.

Hmm, looks like I type far too slowly…

So back to the OP’s question… How does it compare each file on my computer to thousands of definitions, and not take a month to complete.

Computers are fast.

IIRC pc-cilen came back with integertry scanners along with scanning for virus like activity as an atempt to combat new viruses - it worked by stopping any suspicious activity and asking you what you want to do. I liked this approach as a backup to a singature search and can see it comming back.

Virus scanners basically look for distinictive strings of bytes called a signature. Each virus has its own signature. In the bad old days of MS-DOS computing viruses were acutally more sophisticated than the lame Word/Excel macro viruses we have now.

Because it doesn’t do comparisons. There are things called “search trees” which are much faster than individual comparisons. Imagine you sort a bunch of words by their alphabet. Then you make “branches” with each letter, so your first branch is every word that starts with “a” and subsequent branches are also sorted by letter. If the word aardvark comes up, you don’t have to compare it to every word in the dictionary. You go down the A branch, then the A branch under that, then the R branch, then the D branch, etc. until you get to the end of the word. If the word is in the tree, then you have a match. If not, then it’s not in the dictionary. Searches can go very quickly if for example you get to the letter V and there is no V branch, then you don’t have to bother searching any further since you know the word isn’t there.

That’s just one of many types of search trees, but it illustrates the point. Instead of comparing the word “aardvark” with 10,000 dictionary entries, you get at most 8 compares as you search through the tree. Instead of using letters, the virus programs are searching on byte values, but its’ the same basic idea.