How does antivirus software scan so fast?

They all do it so this can’t be some trade secret. How in hell does a scan go through one file so quickly since I’m thinking its looking for 1000 different patterns.

Heck, it takes Windows forever to search a hard disk for all files with a particular name; one would think that’s an easier task.

Anyone in the know?

Searching in general is something a lot of computer programs have to do, not just antivirus programs. Entire books have been written on algorithms for searching. The programs are probably using data trees. It’s not going to go through and compare each piece of a file to every known virus signature it has. Instead, it’s going to store data in trees, and traverse the tree branches as the data from the file is read.

A tree works like this. Suppose you have four kown “virus” patterns, ABC, ABDF, ACE, and BQX (using letters to represent the patterns - in the computer these would actually be bytes of data). You start with a node and call it A. This has two branches, one that goes AB, and the other goes AC. The AB branch further splits into two branches, one that goes ABC and the other that goes ABDF. The AC path doesn’t branch. The last virus doesn’t fit into the A node, so it’s a seperate branch right from the start.

Now you have a “program” that has the pattern XYZABDFQYB. Instead of comparing this with all four of your different patterns, you start traversing the tree. X isn’t one of your initial branches, so you can just skip it. The same for Y and Z. When you get to the A, now you’ve found a branch. Remember that your virus tree has two branches off of A, one that goes AB and the other that goes AC. You look at the next letter and it is a B, so you go down the AB branch. If you had any letter other than B at this point you could stop searching, because you know the data isn’t in your tree. The AB branch has two branches, one that goes ABC and the other that goes ABDF. The next letter is a D, so now you know you are on the ABDF branch. You look at the next letter. If it’s an F, which it is, then you’ve found your virus. If not, you are done searching and can move on to the next letter in your “program”. You have 13 letters worth of viruses in your database, but in only 7 letter comparisons (if I counted right) you found your virus.

As far as the antivirus pattern matching, it goes like this:

  1. Load list of patterns to match against from disk into RAM.
  2. Pull file #1 into memory, run file #1 against patterns, if no match, forget about file #1. Keep patterns in RAM.
  3. Pull file #2 into memory, run file #2 against patterns, if no match, forget about file #2. Keep patterns in RAM…
  4. Run out of files. Announce results, prompt user for actions.
    Since the pattern stays in RAM after initial load, the expensive part is pulling files into RAM to examine them.

In theory, virus scans should be slightly quicker than a file search, but remember that your CPU is exponentially faster than your disk drive is at transferring and processing data, somewhat minimizing the difference.
Remember that with default antivirus settings, only SOME files get scanned. Many file types are ignored.

Look for the next Windows to have faster searches due to its database-derived file systems. With luck, they’ll index it, and you’ll search a TABLE, not a disk.

Engineer_comp_geek’s explanation is pretty good, but only answered the “How do the do it fast” question. As to why Window’s is so amazingly slow, you’ve got me. The best answer I can give is an off-comment by an ex-Microsoft guy saying that vast portions of Windows was written by interns. Though I would also note that for a file scan, usually you just start it and then go off web browsing or whatever. While as when you want to search for a file, you generally sit there staring at it. I.e. a watched pot never boils.

You might try doing a full system search for a text string inside a file–a string you know won’t be found–and time it. Then you could see if it really takes longer or if it just seems like it.

But indeed, using a tree, a pattern search for a fixed string or several different data patterns shouldn’t be a great difference in speed (though the latter does have a little more overhead–but that won’t be noticable since as Mr. Slant said, you can finish the scan in memory faster than you can suck in the next file off the harddrive.)

And the preeminent one is:
Art of Computer Programming, Volume 3: Sorting and Searching by Donald E. Knuth

Going through one file is very quick you only have to read the disk to get the file once. Looking for a file name on the disk takes a longer time because you have to read many different directory files. You have to move the read head all over the disk etc. It is getting the directoy info that takes long not the search.