To spam or not to spam; I have an answer (maybe)
Sterling Camden
Overall I’ve been very happy with my getlessmail spam filter. Using a simple Ruby script to describe the rules for weeding out the canned content posing as real meat provides enough flexibility without complexity most of the time. However, spammers are a smart lot (if you overlook the questionable decision to get into the spamming game in the first place). They take great pains to defy simple rules for identifying them as spammers. I needed something additional.
A few years ago, I read Paul Graham’s essay A Plan for Spam and its sequel Better Bayesian Filtering. The simple yet convincing logic of his approach fascinated me, and I itched to try implementing it. Now that I have a reason to do so, I wrote one in Ruby. I implemented it as a class (IsSpam), with a command-line utility wrapper (isspam). The former can be easily built into a getlessmail script without spawning another process (not that there’s anything wrong with that), while the latter can be used from an MUA or cron job to populate the database.
You can grab the tarball at the link below, or clone the Mercurial repository from BitBucket. See the README for an overview. I’ve also included a man page for the command-line utility, and RDoc pages for the Ruby class.
I’ve followed Graham pretty closely in translating the algorithms from Lisp to Ruby. Some exceptions include how I handle non-word characters and how I test for phrases.
For non-word characters, I do two things: I test the words without them, and with them. The exception to this are the punctuation characters [.:;,], which I don’t include in a word if it is followed by whitespace (this pattern can be overridden by the ‘word_split’ attribute). Additionally, if a word ends in [?!] (overridable via the ‘trailing’ attribute), then I also test the same word without that character, recursively. Thus, “buy!!!” tests as “buy!!!”, “buy!!”, “buy!”, and “buy”.
In addition to testing each word individually, I test the combination of adjacent words up to the value of the ‘max_phrase_length’ attribute (3 by default, but it can be overridden). Certain combinations of words should have their own score, but if you test for too long a combination, then you penalize performance for little or no gain. In any case, I don’t test any phrase longer than 256 characters.
So far, the results look promising. My only problem is that I don’t yet have enough data. For the first time in my life, I find myself looking forward to receiving more spam, so I can collect a better sample size.
Posted in Ruby, Unix |
27 Comments » RSS 2.0 | Sphere it!




