So I installed SpamBayes , which as far as I can tell is the only freeware, Outlook-compatible-plugin, Bayesian spam filter that doesn’t require you to have access to your mailserver or be a network administrator. It also gets good reviews, and apparently a lot of work went into it, so despite the title here, I’m not trying to bash, just understand if there’s some flaw in the program, or if I’ve trained it improperly.
Because when I first went looking for a Bayesian “learning” filter, I was drawn by the premises of iterative learning and the alleged superiority over static rule-based filters (my company runs SpamAssassin, which seems to have an awful lot of false-negatives). I got excited by Paul Graham’s article on Bayesian filtering, in which he claimed that only 4 of about 1750 spams made it by his filter after training, for a false-negative rate of .25% (with very favorable false-positive rates too).
Well . . . after almost three months, and training on a corpus of about 200 spams, with the sensitivity set to 90% (threshold required to reject suspect spam) – I’m achieving nothing like Graham’s results with SpamBayes release 1.0rc2. Instead, I think I am getting about 50% false-negatives, i.e., at least half of my total incoming spam gets by the “Junk” or even “Junk Suspect” folder and into my inbox.
What’s even more worrisome is that when I check some of these spams, they often show a “spam clue” score of only 10%-20%, even after I’ve trained the filter on very similar spams. And the ones that are sneaking by are not fiendishly clever variations; they are often still the ones that contain the words “medication” or “cheap software” or “perscription,” despite the fact that none or almost none of my ham contains these words, whereas multiple quarantined spams do.
Am I doing something wrong? Is my spam corpus simply not big enough for effective training to have taken hold? If you’re going to suggest setting my sensitivity threshold lower than 90% . . . I would, except that I already have a somewhat-worrisome false-positive rate (maybe one message a day getting misrouted into “Junk” or “Junk Suspect”), and I know that would increase if I dramatically lowered the cutoff. Or are the stupid spammer tactics of including random text really working?
Any other suggestions?