Chip's Tips for Developers

Contains coding, but not narcotic.

To spam or not to spam; I have an answer (maybe)

July 19th, 2010 4:15:48 pm pst by Sterling Camden

Overall I’ve been very happy with my getlessmail spam filter. Using a simple Ruby script to describe the rules for weeding out the canned content posing as real meat provides enough flexibility without complexity most of the time. However, spammers are a smart lot (if you overlook the questionable decision to get into the spamming game in the first place). They take great pains to defy simple rules for identifying them as spammers. I needed something additional.

A few years ago, I read Paul Graham’s essay A Plan for Spam and its sequel Better Bayesian Filtering. The simple yet convincing logic of his approach fascinated me, and I itched to try implementing it. Now that I have a reason to do so, I wrote one in Ruby. I implemented it as a class (IsSpam), with a command-line utility wrapper (isspam). The former can be easily built into a getlessmail script without spawning another process (not that there’s anything wrong with that), while the latter can be used from an MUA or cron job to populate the database.

You can grab the tarball at the link below, or clone the Mercurial repository from BitBucket. See the README for an overview. I’ve also included a man page for the command-line utility, and RDoc pages for the Ruby class.

I’ve followed Graham pretty closely in translating the algorithms from Lisp to Ruby. Some exceptions include how I handle non-word characters and how I test for phrases.

For non-word characters, I do two things: I test the words without them, and with them. The exception to this are the punctuation characters [.:;,], which I don’t include in a word if it is followed by whitespace (this pattern can be overridden by the ‘word_split’ attribute). Additionally, if a word ends in [?!] (overridable via the ‘trailing’ attribute), then I also test the same word without that character, recursively. Thus, “buy!!!” tests as “buy!!!”, “buy!!”, “buy!”, and “buy”.

In addition to testing each word individually, I test the combination of adjacent words up to the value of the ‘max_phrase_length’ attribute (3 by default, but it can be overridden). Certain combinations of words should have their own score, but if you test for too long a combination, then you penalize performance for little or no gain. In any case, I don’t test any phrase longer than 256 characters.

So far, the results look promising. My only problem is that I don’t yet have enough data. For the first time in my life, I find myself looking forward to receiving more spam, so I can collect a better sample size.

download

Posted in Ruby, Unix | 27 Comments » RSS 2.0 | Sphere it!

GetLessMail gets more info

July 5th, 2010 2:43:35 pm pst by Sterling Camden

In one of my curious moods, I began to wonder how difficult it would be to figure out the location of an email sender based on the IP address shown in the “Received” header fields. It turns out to be more difficult than you may have thought, because:

  1. An email often contains multiple “Received” headers, one for each relay point. The innermost (last) is the original sender.
  2. However, the original transmission is often within a local network, so the first one or few IPs may be in the reserved local range.
  3. No free, global, authoritative database exists that contains the location of all IPs. At least, not that I’ve found. However, there are some free databases you can download that are updated from time to time.
  4. The owner of the IP address may not be located at the same place as the connection. In fact, it usually isn’t, but it may be close.

Despite these impediments, I have implemented IP Geolocation for Ruby, and created a method specialized for GetLessMail that uses it.

The two scripts IPGeo.rb and IPGeoMail.rb should be placed somewhere in your Ruby require path. The example database, which I downloaded from http://linuxbox.co.uk/ip-address-whois-database.php, should be placed in /usr/local/share/IPGeo (or you can modify the script to access it wherever you choose). The included dot.getlessmail shows how you could use it to add an “X-IP-Location” header that provides the IP Location data, if found.

As I intimated, you could also use IPGeo.rb outside of the context of email. It would be trivial to write a script that accepts an IP Address and prints out the information. Like so:


require 'IPGeo'
$<.each do |line|
  puts IPGeo.locate IPGeo.get_ip(line)
end

Of course, this information is only as good as your database. The one I've included hasn't been updated since August 2009. You can probably find better databases out there, if you're willing to spend some money on them. I'm not.

You can get the updated tarball using the button below, or scrape it out of the BitBucket.

download

Posted in Ruby, Unix | 4 Comments » RSS 2.0 | Sphere it!

Script email filtering with Ruby

April 22nd, 2010 5:32:49 pm pst by Sterling Camden

image I’ve used all sorts of email filters since my very first internet email account in the early 90s – and none of them have been quite right.  I’d like to be able to block anything about Viagra, but not when a friend or family member uses the word.  Pure Bayesian filters always seem to block something from someone I know, while letting a few of the real spam messages through.  But whitelists and blacklists suffer from a “which rule comes first” problem.

I recently moved to FreeBSD as my primary workstation OS, and I’m now reading my email with mutt, after delivery by getmail.  Getmail has a pretty easy configuration for inline filters, so I decided to create a rules engine for filtering messages the way I want to.  I decided to write it in Ruby, which naturally led to the creation of a simple EDSL in Ruby for manipulating email content and approving or rejecting an incoming message.  Since it’s intended for use with getmail, I decided to call it “getlessmail”.

By connecting the getlessmail.rb script (which you can download below) into getmail as an external filter, you can write a user-specific script in Ruby to specify your filtering rules, like so:

keep if from “mybestfriend@example.com
spam if from “@example.com
spam if subject “viagra|cialis”
spam if body “(?m:\bnude\b.*\bpics\b)”

With this ordering, mybestfriend@example.com is automatically approved, while anybody else from that domain is considered spam.  Likewise, mybestfriend can use viagra or cialis in the subject line, or “nude” followed by “pics” in the body, and it will still be approved – but not if from anyone else.

As you can see, the patterns are regular expression fragments.  These get sewn into larger expressions that isolate their intended context.  By default, they’re treated as case-insensitive and not multi-line – but you can turn any options on or off using the contextual options grouping supported by Ruby regexen (as I have with “(?m:)” in the last example entry above).  Patterns are always automatically parenthesized to avoid issues with operator precedence, so don’t add enclosing parentheses of your own unless you need them for other reasons.

But there’s more.  I’ve included methods for moving messages to folders automatically, and for manipulating message headers.  The folder operations assume that your mailboxes are stored as mbox files, so don’t use them if you’re using maildir format instead.

But that’s not all.  Since your rules script is interpreted as Ruby code, you can go crazy.  Log events, change the contents of the message, translate attachments, write your own Bayesian filter, or anything else you can do with Ruby.

I’ll probably extend the core functions at some point to deal more easily with multi-part messages.  My number one beef is the ms-tnef MIME format, which merely wraps attachments in a Microsoft-specific container.  There’s a tnef utility for unpacking that, so I should be able to strip out attachments in that format, pipe them to tnef, and then sew the resulting files back together in regular MIME multi-part.

See the README file for full documentation.  The download below is in tar.bz2 format, since it’s really only useful on Unix or Linux, where most tar implementations should be able to read it as is.

download

Posted in Ruby, Unix, Wildly popular | 4 Comments » RSS 2.0 | Sphere it!

Better Tag Cloud