OSBF-Lua


Text classification module for the Lua Programming Language
and a production class anti-spam in Lua using the module

Winner of TREC's Spam Track 2006


Overview  ·  What's new  ·  Download and Contributions  ·  Installation  ·  Manual  ·  Credits  ·  Contact



Overview

OSBF-Lua (Orthogonal Sparse Bigrams with confidence Factor) is a Lua C module for text classification. It is a port of the OSBF classifier implemented in the CRM114 project. This implementation attempts to put focus on the classification task itself by using Lua as the scripting language, a powerful yet light-weight and fast language, which makes it easier to build and test more elaborated filters and training methods.

The OSBF algorithm is a typical Bayesian classifier but enhanced with two techniques that I originally developed for the CRM114 project: Orthogonal Sparse Bigrams - OSB, for feature extraction, and Exponential Differential Document Count - EDDC (a.k.a Confidence Factor), for automatic feature selection. Combined, these two techniques produce a highly accurate classifier. OSBF was developed focused on two classes, SPAM and NON-SPAM, so the performance for more than two classes may not be the same.

spamfilter.lua is an anti-spam filter written in Lua using the OSBF-lua module.  It takes special advantage of EDDC to introduce TONE-HR, a highly effective training method. The combination of OSB, EDDC and TONE-HR to enhance a classical Bayesian classifier resulted in the best spam filtering performance in TREC's Spam Track 2006.

The Confidence Factor was officially introduced in the paper "Exponential Differential Document Count - A Feature Selection Factor for Improving Bayesian Filters Accuracy", presented in the MIT Spam Conference - 2006, after being in experimental use for more than a year in both projects: CRM114 and OSBF-Lua. The conference slides are also available.

The OSB technique was officially announced in the paper "Combining Winnow with Orthogonal Sparse Bigrams for Incremental Spam Filtering", a work headed and presented by Christian Siefkes in the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), on September 2004.

The CRM114 implementation of OSBF was one of the classifiers submitted to the TREC's Spam Track 2005  by the CRM114 team, but its first results were not good because of a bug. Later, the bug was fixed and the OSBF-Lua version was submitted to the track coordinator, prof. Gordon Cormack, for an extra evaluation. The new results were comparable to those of the best participants, with the advantage of being 5 to 10 times faster. Our notebook paper comments on the results of the four filters submitted by the CRM114 team: OSBF, Winnow, OSB Unique and OSB.

OSBF-Lua is free software and is released under the GPL version 2. You can get a copy of the license at GPL. This distribution includes a copy of the license in the file gpl.txt.

 Download and Contributions

 What's new

Installation

OSBF-Lua requires Lua 5.1 installed with dynamic loading enabled. OSBF-Lua was developed and tested under the Lua 5.1 work, alpha, beta and final 5.1 versions. It probably won't work with previous versions.

Installation steps:

If you want to install in the default dir you must be root to do the "make install" step. If you don't have root access, you may set PREFIX to point to a dir you have write access to, for instance $HOME/lib. You need to add the new installation dir to LUA_CPATH so that Lua loader can find osbf.so

Ex: Installing in $HOME/lib

<edit config and set PREFIX to $HOME/lib>

$ mkdir $HOME/lib

$ make install

$ export LUA_CPATH=$HOME/lib/?.so:$LUA_CPATH

After osbf module is properly installed, you may want to install the spamfilter, a Lua script that uses the OSBF-Lua module to classify and tag messages as spam or non-spam (ham) according to the score they get, or to the white/blacklists, if any:

 make install_spamfilter

The spamfilter files are installed in /usr/local/osbf-lua. If the dir doesn't exist it'll be created

The next step is to configure your email account to use the spamfilter:

# set OSBF_LUA_DIR to where spamfilter.lua, spamfilter_command.lua etc were installed

OSBF_LUA_DIR=/usr/local/osbf-lua # change '/usr/local' to your PREFIX

OSBF_LUA_USER_DIR=$HOME/osbf-lua

# let the Lua interpreter find the "osbf" module.

# uncomment if you installed a local copy of the osbf module (e.g. no root access)

#LUA_CPATH=$LUA_CPATH:$HOME/lib/?.so


:0fw: .msgid.lock

* < 350000 # don't check messages greater than 350000 bytes

| $OSBF_LUA_DIR/spamfilter.lua --udir $OSBF_LUA_USER_DIR

Check your installation by sending a message to yourself with the following command in the subject line:

help <your password>

You should receive a message with a help on the spamfilter. Then, send another command in the subject line to verify that the databases were created correctly:

stats <your password>

You should get a statistics report on the just created databases.

From now on, all messages you receive with less than max size specified in the procmail recipe will be classified and tagged according to the score they get:


Tag

Meaning

[--]

almost sure it's a spam - score <= -20

[-]

probably it's a spam (reinforcement zone) - score < 0 and > -20

[+]

probably it's not spam (reinforcement zone) - score >=0 and < 20

[++]


almost sure it's not spam - score >= 20. This tag is here just for symmetry, it's not used. An empty tag is used in place of it so as not to pollute the messages.

If the classification is wrong you must train the filter replying (you must do a "Reply", not a "Forward") the message back to yourself, replacing the subject with the correspondent training command:

learn <password> spam

or

learn <password> nonspam

The body of the message may be erased, it's not required. The original message, temporarily saved on the server by the spamfilter, will be recovered through the SFID (Spam Filter ID), a special mark added to the header when the message arrived to the server.

If you make a mistake, you should undo the training with the command "unlearn". Ex:

unlearn <password> spam

if wrongly trained as spam.

Training when the classification is wrong is essential for accuracy. Training when in the reinforcement zone, called reinforcement, is highly recommended for increasing and keeping the accuracy high. After you have a well trained filter, say 99% or better accuracy, you may want to reduce the reinforcement zone, eg. [-10, 10], so as not to do many reinforcements a day. You may change the reinforcement zone, tags, etc, by editing the file spamfilter_config.lua.

Credits

The OSBF-Lua lib and spamfilter.lua were designed and implemented by Fidelis Assis, who holds the primary copyright.

The OSB technique, as well as the OSBF classify and learn codes are based on the OSB and OSBF I originally developed for the CRM114 project, as a derivative work based on Bill Yerazunis' CRM114 Markovian classifier. Bill Yerazunis holds the secondary copyright on the OSBF-Lua lib.

Contact

For more information please email me. Comments are welcome!