From ConShell
Jump to: navigation, search


I've used SpamAssassin for quite a while, but over time, it seemed to become less and less effective at filtering spam. I considered why this might be and thought that maybe the spammers were just outsmarting it (I saw a lot of scores just under 5), many were foreign-language spam with different character sets. I just got into this new feature... sa-learn. This is a method of informing spamassassin what you consider spam or ham in the context of email. To be effective, you must have built up quite a collection of each type.

These commands can be run on your existing mailbox(es) to teach spamassassin how to separate the ham from the spam.

Examples of Bayesian training

These assume an mbox format.

 sa-learn --spam --no-sync --showdots --local --mbox ~mark/imap/SpamTrap
 sa-learn --spam --no-sync --showdots --local --mbox ~mark/imap/SpamActual
 sa-learn --ham --no-sync --showdots --local --mbox ~mark/imap/2005
 sa-learn --ham --no-sync --showdots --local --mbox ~mark/imap/2004
 sa-learn --ham --no-sync --showdots --local --mbox ~mark/imap/2003
 sa-learn --ham --no-sync --showdots --local --mbox ~mark/imap/2002
 sa-learn --sync

Use man sa-learn to find out more.

At first, doing this didn't help my spam problem, because as it turned out, spamd was being run under a different user than myself (mark). Amavisd calls runs spamassassin, so I used the pstree -aup to find out what user id was running it. It was amavis, so I ran the same sa-learn commands again, but this time as amavis.

Here are the message statistics from each mailbox, which I believe puts me into the sweet spot for bayesian effectiveness (based on the sa-learn man page).

  • SpamTrap - 2273 message(s)
  • SpamActual - 15 message(s)
  • 2005 - 436 message(s)
  • 2004 - 1368 message(s)
  • 2003 - 2286 message(s)
  • 2002 - 711 message(s)

Running sa-learn --sync produced this output...

 expired old Bayes database entries in 82 seconds
 126481 entries kept, 81987 deleted
 token frequency: 1-occurence tokens: 55.12%
 token frequency: less than 8 occurrences: 31.14%

Update 2005-Dec-30

Well based on what I have seen in /etc/cron.daily/amavisd-new it appears the bayesian database may need to be built/owned by the amavis user. So the commands I should use running spamc/spamd in conjunction with amavisd would seem to be...

 su amavis -c "sa-learn --spam --no-sync --progress --local --mbox /tmp/Spam*"
 su amavis -c "sa-learn --ham --no-sync --progress --local --mbox /tmp/200[2345]"
 su amavis -c "sa-learn --sync"

Note that I had to copy my personal mboxes into /tmp and widen the perms for amavis to read them. *Sigh*

Update 2006-Jan-19

Now I am getting a strange error.

 su amavis -c "sa-learn --ham --no-sync --showdots --local --mbox /tmp/2006"
 bayes: bayes db version 0 is not able to be used, aborting! 
  at /usr/share/perl5/Mail/SpamAssassin/BayesStore/ line 160.

This has happened twice now. The problem seems to go away after I keep trying the command in rotation with:

 sa-learn -D --sync

This is still a mystery though.

Update 2006-Dec-20

More and more, image spam is leaking through. They spammers have figured out how to put their spam message into an embedded image. The text in the image is readable, but blocky and runs together. There is also a significant amount of noise and/or anomalies in the images to defeat OCR programs. Finally, these types of spam almost always contain a bunch of seeming random text designed to defeat bayesian filters.

I have looked into gocr and tested a few of the images coming through, reaching the conclusion that the spammers are winning this particular battle. There is the FuzzyOCR plugin to SA, but to this point have not tried it. I hope to soon.

See Also

References and resources related to this topic.