Bayesian spam filters: good idea?

Bayesian spam filters are smart. They dump spam into the trash can so you don’t have to. They’re a lot less obtrusive than challenge-response filtering. They’re easier to manage than blacklists or whitelists. They’re good enough to be built into email clients such as Thunderbird, and for Outlook users, free add-on filters such as SpamBayes get rave reviews.

So what’s not to like?

Well, it seems to me that in order for Bayesian filters to work, they need to read each message to determine whether it contains spam or ham, right? Given that some spam messages contain web bugs, doesn’t the process of filtering them then allow these bugs to validate your address – and therefore lead to more spam? Do I misunderstand how these filters work, or is that a huge flaw?

Just reading a message does not do any harm - “It’s just ones and zeros.” The problems can start as soon as your mail program tries to interpret the message or part of it (e.g. an attachment) as more than just text. When a mail client displays an attached document, macros might be executed. The mail client might contain errors and do things it shouldn’t do when it is trying to render a message. Actually, on a modern system where countless programs are involved, many things can be linked to each other and many things happen automatically, there are numerous pontential dangers.
However just reading a message as a sequence of bytes/characters is harmless.
Btw. it is the same thing for files and viruses.

Web bugs do their damage when an html-enabled program downloads the bug. Spam filters don’t do this, so they can’t trigger the bug.

BTW I have used Popfile and also Opera’s learning spam filter (which I presume is bayesian) and they both work great (especially Popfile).

These “web bugs” that you mention are small graphics images that get loaded from a remote server when your HTML mail client displays the message. I don’t view any messages as HTML, but any decent mail client should offer you the setting to not load remote images, and you should have that turned on.

And this doesn’t affect the bayesian filter anyway, because the filter won’t be trying to load those images, it will just look at the contents of what got sent.