I've used SpamAssassin for quite a while, but over time, it seemed to become less and less effective at filtering spam. I considered why this might be and thought that maybe the spammers were just outsmarting it (I saw a lot of scores just under 5), many were foreign-language spam with different character sets. I just got into this new feature... sa-learn. This is a method of informing spamassassin what you consider spam or ham in the context of email. To be effective, you must have built up quite a collection of each type.
These commands can be run on your existing mailbox(es) to teach spamassassin how to separate the ham from the spam.
Examples of Bayesian training
These assume an mbox format.
sa-learn --spam --no-sync --showdots --local --mbox ~mark/imap/SpamTrap sa-learn --spam --no-sync --showdots --local --mbox ~mark/imap/SpamActual sa-learn --ham --no-sync --showdots --local --mbox ~mark/imap/2005 sa-learn --ham --no-sync --showdots --local --mbox ~mark/imap/2004 sa-learn --ham --no-sync --showdots --local --mbox ~mark/imap/2003 sa-learn --ham --no-sync --showdots --local --mbox ~mark/imap/2002 sa-learn --sync
Use man sa-learn to find out more.
At first, doing this didn't help my spam problem, because as it turned out, spamd was being run under a different user than myself (mark). Amavisd calls runs spamassassin, so I used the pstree -aup to find out what user id was running it. It was amavis, so I ran the same sa-learn commands again, but this time as amavis.
Here are the message statistics from each mailbox, which I believe puts me into the sweet spot for bayesian effectiveness (based on the sa-learn man page).
- SpamTrap - 2273 message(s)
- SpamActual - 15 message(s)
- 2005 - 436 message(s)
- 2004 - 1368 message(s)
- 2003 - 2286 message(s)
- 2002 - 711 message(s)
Running sa-learn --sync produced this output...
expired old Bayes database entries in 82 seconds 126481 entries kept, 81987 deleted token frequency: 1-occurence tokens: 55.12% token frequency: less than 8 occurrences: 31.14%
Well based on what I have seen in /etc/cron.daily/amavisd-new it appears the bayesian database may need to be built/owned by the amavis user. So the commands I should use running spamc/spamd in conjunction with amavisd would seem to be...
su amavis -c "sa-learn --spam --no-sync --progress --local --mbox /tmp/Spam*" su amavis -c "sa-learn --ham --no-sync --progress --local --mbox /tmp/200" su amavis -c "sa-learn --sync"
Note that I had to copy my personal mboxes into /tmp and widen the perms for amavis to read them. *Sigh*
Now I am getting a strange error.
su amavis -c "sa-learn --ham --no-sync --showdots --local --mbox /tmp/2006" bayes: bayes db version 0 is not able to be used, aborting! at /usr/share/perl5/Mail/SpamAssassin/BayesStore/DBM.pm line 160.
This has happened twice now. The problem seems to go away after I keep trying the command in rotation with:
sa-learn -D --sync
This is still a mystery though.
More and more, image spam is leaking through. They spammers have figured out how to put their spam message into an embedded image. The text in the image is readable, but blocky and runs together. There is also a significant amount of noise and/or anomalies in the images to defeat OCR programs. Finally, these types of spam almost always contain a bunch of seeming random text designed to defeat bayesian filters.
I have looked into gocr and tested a few of the images coming through, reaching the conclusion that the spammers are winning this particular battle. There is the FuzzyOCR plugin to SA, but to this point have not tried it. I hope to soon.
References and resources related to this topic.