Do I Have The Stupidest Bayesian Spam Filter, Ever?

So I installed SpamBayes , which as far as I can tell is the only freeware, Outlook-compatible-plugin, Bayesian spam filter that doesn’t require you to have access to your mailserver or be a network administrator. It also gets good reviews, and apparently a lot of work went into it, so despite the title here, I’m not trying to bash, just understand if there’s some flaw in the program, or if I’ve trained it improperly.

Because when I first went looking for a Bayesian “learning” filter, I was drawn by the premises of iterative learning and the alleged superiority over static rule-based filters (my company runs SpamAssassin, which seems to have an awful lot of false-negatives). I got excited by Paul Graham’s article on Bayesian filtering, in which he claimed that only 4 of about 1750 spams made it by his filter after training, for a false-negative rate of .25% (with very favorable false-positive rates too).

Well . . . after almost three months, and training on a corpus of about 200 spams, with the sensitivity set to 90% (threshold required to reject suspect spam) – I’m achieving nothing like Graham’s results with SpamBayes release 1.0rc2. Instead, I think I am getting about 50% false-negatives, i.e., at least half of my total incoming spam gets by the “Junk” or even “Junk Suspect” folder and into my inbox.

What’s even more worrisome is that when I check some of these spams, they often show a “spam clue” score of only 10%-20%, even after I’ve trained the filter on very similar spams. And the ones that are sneaking by are not fiendishly clever variations; they are often still the ones that contain the words “medication” or “cheap software” or “perscription,” despite the fact that none or almost none of my ham contains these words, whereas multiple quarantined spams do.

Am I doing something wrong? Is my spam corpus simply not big enough for effective training to have taken hold? If you’re going to suggest setting my sensitivity threshold lower than 90% . . . I would, except that I already have a somewhat-worrisome false-positive rate (maybe one message a day getting misrouted into “Junk” or “Junk Suspect”), and I know that would increase if I dramatically lowered the cutoff. Or are the stupid spammer tactics of including random text really working?

Any other suggestions?

A few questions: 200 sounds a bit low for fully effective training. That would be about a day for my accounts. Do really get that little spam in months?

You say “Training on a corpus” - do you somehow feed those samples to your filter manually, possibly more than once? If you do, don’t do that unless you are very sure what you are doing. If you have a statistical classifier, keep your probabilities “natural” and avoid cheating the system. It is very easy to create a bias in the classifier if you fiddle with the training data.

When you get those false negatives, are you using them to further train your filter, or are you just deleting them?

BTW, I am using Popfile, another Bayesian filter, and am getting over 99% accuracy, which for me means about 1 or 2 false negatives per month.

I should have been clearer. I initially ran a training session on a batch basis with a roughly-equivalent number of ham and spam (recommended by SpamBayes).

Since then, I’ve been transferring all false-negatives from my inbox to my “Junk Messages” folders, and all false-positives from my “Junk/Junk Suspect” folder to my inbox, and (if I understand the setup correctly and have checked the right setup options), this transfer should be the basis for further training.

Maybe my corpus is just too small, to date (my company’s domain isn’t that well known and my address is pretty obscure, so I guess I’m lucky in getting dozens of spams per week, not hundreds, at work – the initial mailserver filtering done by SpamAssassin may also be keeping out the more grossly obvious candidates).

I was just surprised, given by how rapidly Graham claimed Bayesian filters could evolve to catch spammer chicanery, at the fact that “perscription,” say, hasn’t yet been deemed a 100% definitive token for spam. Anyway, I’ll hold out hope that the thing will get smarter as the corpus slowly grows.

I use a Bayesian filter, both for myself and some people who share my domain (which greatly complicates the detection of false positives) Together, I’d say all these accounts get at least 500 spam/day (we’ve had this domain since the days when domains were free). After programming with the previous weeks spam -and NONspam (very important), I found I had a good accuracy (say 80%) with the filter set to 50%.

I set it deliberately low (50%) because I knew that I couldn’t monitor the junk mailbox indefinitely at these volumes, and I wanted the filers to catch questionable calls early, so that I could instruct it that they were not junk.

Only a small volume of letters scored 50-60%, and that range contained ALL my false positives. After another week, that fraction decreased and didn’t contain any false positiives. After a month, there were practically no letters in that range, and I raised the limit to 60% – I got a few more spam, but by manually screeening those few “tough calls” and the junked mail with scores of 60-70%, I quickly trained the system to the point that I felt comfortable that I wouldn’t be losing any legitimate mail.

I think you set your score threshold far to high far too early.

If I’m reading the FAQ correctly, that folder method only works for the initial training. In day-to-day use, you need to use the “Delete as spam” and “Recover from spam” buttons.

Yeah.

Don’t talk about the ‘stupidity’ of others when you’re still using Outlook.

Yeah – see my disclaimer (and as I’m finding out, I may not even have trained it right; one of the possibilities I’ve considered all along is certainly that my own stupidity is in play).

Agreed, I’d be running some tricked out Linux box with all-non-MSFT stuff if I weren’t wedded to the company platforms to some extent. I’m already in violation of about six separate corporate policies for installing outside software, reconfiguring my firewall settings, changing registry settings, etc., but it’s saved me from a couple of viruses and worms that have propagated through the office . . . .

Thanks – after looking through my menus, though, while you are correct in the default case, there are checkable options in the filtering menu that allow the program to recognize manual transfer into the spam/ham folders as a training event (just as when you use the “delete as spam” and “recover from spam” hot buttons), and I did have these checked (the corpus size continues to grow each time I manually transfer into the folders, too).

t-bonham@scc.net, your comments are inappropriate for GQ. Kindly refrain from making such comments in the future.

Thank you.

-xash
General Questions Moderator

So MS bashing is off limits in GQ? Or is his sin that he referred to using MS products as stupid?

The insinuation that the OP is stupid for using Outlook.

Stating specific flaws of a particular software or OS is certainly permitted, but MS bashing per se is discouraged in GQ.

Although, in fairness to t-bonham@scc.net, he probably didn’t mean any malice towards the OP. Which is why my post was labelled as “Moderator’s Notes” and not “Moderator’s Warning”.

It’s just a guideline to keep GQ clean for all of us.

-xash
General Questions Moderator

In my experience you should be getting a much better hit rate that 50%. I used SpamBayes for quite some time, and my false negative rate seemed about 5%, and false positive below 1%. I stopped using it about 6 months ago, though, so my experience may be dated. More and more spam is trying to employ “bayes poison” with huge chunks of random words thrown in to take advantage of any positive bayes scores that might be associated with them, and also to poison the bayes database so that now those words have a negative bayes score. This might account for the slower than expected training, but in the end it should be able to cope with most bayes poison.

When I trained my database, I used a corpus of about 10,000 spam, and about 1,000 ham. I get a lot more spam than ham. I think your results would be better with a better initial training.

I stopped using SpamBayes when my ISP turned on bayesian filtering in Spamassassin. Due to the way mail traverses the mail servers, my ISP doesn’t do per-user databases, so the bayesian filtering is a little less accurate than you should be able to get with a local database. Combined with the other Spamassassin scoring, though, it does a great job. If your company is already using Spamassassin, perhaps you could have them turn on bayesian filtering along with network tests (RBLs etc) to decrease the false negatives. Also, if your email admins are open to suggestions, there are many custom rules for Spamassassin that compliment the default ones.

There is an official Spambayes mailing list where you might have a better chance of getting an answer.

I noticed you misspelled prescription in both your posts - WAG could this be why it is still getting through, if you’ve spelt it the same way?

Well, my point was that at this juncture, I get a lot more spam spelling it “perscription” than I get ham using either “prescription” or “perscription,” so I’m thinking "perscription’ ought to be a pretty strong spam token.

But in either event, while I think with Bayesian filters you can also define custom rules, my basic setup doesn’t rely on my defining any individual tokens – so I never enter any spelling of the word, at all. My only action is to designate a given e-mail as being, on the whole, “spam” or “ham.” Then the program ‘decides’ whether, based on such designations, the e-mail’s containing ‘perscription’ or ‘prescription,’ respectively, is the stronger spam/ham token, and how such token should be weighted with dozens of other tokens.