OSBF-Lua

Text classification module for the Lua Programming Language

(and a complete anti-spam filter in Lua using the module)

overview · download · manual · credits · contact


Contents

Overview

OSBF-Lua (Orthogonal Sparse Bigrams with confidence Factor) is a C module for text classification written for Lua. It is a port of the OSBF classifier implemented in CRM114, http://crm114.sf.net. It borrows many good ideas from Bill Yerazunis' CRM114, like the databases basic structure and the Bayesian chain implementation. The OSBF algorithm is a typical Bayesian classifier but enhanced with the OSB (Orthogonal Sparse Bigrams) feature extraction technique and an ad hoc Confidence Factor (aka “voodoo”), for automatic reduction of the less significant features impact on the classification – noise reduction. The final result is a very fast and accurate classifier. It was developed focused on 2 classes, SPAM and NON-SPAM, so the performance for more than 2 classes may not be the same.

I originally developed both OSB and OSBF for the CRM114 project, but the OSB technique was officially announced in the paper “Combining Winnow with Orthogonal Sparse Bigrams for Incremental Spam Filtering“, a work headed and presented by Christian Siefkes in the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), on September 2004.

This implementation attempts to put focus on the classification task itself by using Lua, a powerful language, yet light-weight and fast, which makes it easier to build and test more elaborated filters and training methods.

OSBF-Lua is free software and is released under the GPL version 2. You can get a copy of the license at GPL. This distribuition includes a copy of the license in the file gpl.txt.

Download

What's new

Installation

OSBF-Lua requires Lua 5.1 installed with dynamic loading enabled. OSBF-Lua was developed and tested under the Lua 5.1 work, alpha, beta and final 5.1 versions. It probably won't work with previous versions.

Installation steps:




You must be root to do the “make install” step. If you don't have root access, you must copy the just created libosbf.so-x.y.z to a directory you have access to, under the name “osbf.so”, and add that dir to LUA_CPATH. Ex:


mkdir $HOME/lib

cp libosbf.so-x.y.z $HOME/lib/osbf.so

LUA_CPATH=$LUA_CPATH:$HOME/lib


After the osbf module is properly installed, you may want to install the spamfilter, a Lua script that uses the OSBF-Lua module to classify and tag messages as spam or non-spam (ham) according the the score they get, or to the white/blacklists, if any:


make install_spamfilter


The spamfilter files are installed in /usr/local/osbf-lua. If the dir doesn't exist it'll be created


The next step is to configure your email account to use the spamfilter:


# set OSBF_LUA_DIR to where spamfilter.lua, spamfilter_command.lua etc were installed

OSBF_LUA_DIR=/usr/local/osbf-lua

OSBF_LUA_USER_DIR=$HOME/osbf-lua

# let the Lua interpreter find the “osbf” module.

# uncomment if you installed a local copy of the osbf module (e.g. no root access)

#LUA_PATH=$LUA_PATH:$HOME/lib


:0fw: .msgid.lock

* < 350000 # don't check messages greater than 350000 bytes

| $OSBF_LUA_DIR/spamfilter.lua --gdir $OSBF_LUA_DIR --udir $OSBF_LUA_USER_DIR


Check your installation sending a message to yourself with the following command in the subject line:


help <your password>


You should receive a message with a help on the spamfilter. Then, send another command in the subject line to verify that the databases were created correctly:


stats <your password>


You should get a statistics report on the just created databases.


From now on, all messages you receive with less than max size specified in the procmail recipe will be classified and tagged according to the score they get:


Tag

Meaning

[--]

almost sure it's a spam - score <= -20

[-]

probably it's a spam (reinforcement zone) - score < 0 and > -20

[+]

probably it's not spam (reinforcement zone) – score >=0 and < 20

[++]


almost sure it's not spam – score >= 20. This tag is here just for symmetry, it's not used. An empty tag is used in place of it so as not to pollute the messages.


If the classification is wrong you must train the filter replying (you must do a “Reply”, not a “Forward”) the message back to yourself, replacing the subject with the correspondent training command:


learn <password> spam

or

learn <password> nonspam


The body of the message may be erased, it's not required. The original message, temporarily saved on the server by the spamfilter, will be recovered through the SFID (Spam Filter ID), a special mark added to the header when the message arrived to the server.


If you make a mistake, you should undo the training with the command “unlearn”. Ex:


unlearn <password> spam


if wrongly trained as spam.


Training when the classification is wrong is essential for accuracy. Training when in the reinforcement zone, called reinforcement, is highly recommended for increasing and keeping the accuracy high. After you have a well trained filter, say 99% or better accuracy, you may want to reduce the reinforcement zone, eg. [-10, 10], so as not to have to do many reinforcements a day. You may change the reinforcement zone, tags, etc, by editing the spamfilter_config.lua file.

Credits

The OSBF-Lua lib and spamfilter.lua were designed and implemented by Fidelis Assis, who holds the copyright.

The OSB technique, as well as the OSBF classify and learn codes are based on the OSB and OSBF I originally developed for the CRM114 project (http://crm114.sf.net), as a derivative work based on Bill Yerazunis' CRM114 Markovian classifier.

Contact

For more information please email me. Comments are welcome!