![]() |
||
OSBF-Lua |
|
|
Text classification module for the
Lua
Programming Language Winner of TREC's Spam Track 2006
|
|
Overview · What's new · Download and Contributions · Installation · Manual · Credits · Contact
OSBF-Lua (Orthogonal Sparse Bigrams with confidence Factor) is a Lua C module for text classification. It is a port of the OSBF classifier implemented in the CRM114 project. This implementation attempts to put focus on the classification task itself by using Lua as the scripting language, a powerful yet light-weight and fast language, which makes it easier to build and test more elaborated filters and training methods.
The OSBF algorithm is a typical Bayesian classifier but enhanced with two techniques that I originally developed for the CRM114 project: Orthogonal Sparse Bigrams - OSB, for feature extraction, and Exponential Differential Document Count - EDDC (a.k.a Confidence Factor), for automatic feature selection. Combined, these two techniques produce a highly accurate classifier. OSBF was developed focused on two classes, SPAM and NON-SPAM, so the performance for more than two classes may not be the same.
spamfilter.lua is an anti-spam filter written in Lua using the OSBF-lua module. It takes special advantage of EDDC to introduce TONE-HR, a highly effective training method. The combination of OSB, EDDC and TONE-HR to enhance a classical Bayesian classifier resulted in the best spam filtering performance in TREC's Spam Track 2006.
The Confidence Factor was officially introduced in the paper "Exponential Differential Document Count - A Feature Selection Factor for Improving Bayesian Filters Accuracy", presented in the MIT Spam Conference - 2006, after being in experimental use for more than a year in both projects: CRM114 and OSBF-Lua. The conference slides are also available.The OSB technique was officially announced in the paper "Combining Winnow with Orthogonal Sparse Bigrams for Incremental Spam Filtering", a work headed and presented by Christian Siefkes in the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), on September 2004.
The CRM114 implementation of OSBF was one of the classifiers submitted to the TREC's Spam Track 2005 by the CRM114 team, but its first results were not good because of a bug. Later, the bug was fixed and the OSBF-Lua version was submitted to the track coordinator, prof. Gordon Cormack, for an extra evaluation. The new results were comparable to those of the best participants, with the advantage of being 5 to 10 times faster. Our notebook paper comments on the results of the four filters submitted by the CRM114 team: OSBF, Winnow, OSB Unique and OSB.
OSBF-Lua is free
software and is released under the GPL version 2. You can get a copy
of the license at GPL.
This distribution
includes a copy of the
license in the file gpl.txt.
The sources can be downloaded from LuaForge
See the full CHANGES
OSBF-Lua requires Lua 5.1 installed with dynamic loading enabled. OSBF-Lua was developed and tested under the Lua 5.1 work, alpha, beta and final 5.1 versions. It probably won't work with previous versions.
Installation steps:
Install Lua with dynamic loading enabled:
For linux, execute "make linux" and "make install". For other OS, read the instructions in the INSTALL file and your OS documentation on how to create shared libs.
You might want to change the occurrences of the O2 flag in CFLAGS to O3, in all makefiles, for increased speed.
Install the OSBF-Lua module:
$ tar
xvzf osbf-lua-x.y.z.tar.gz
$ cd
osbf-lua-x.y.z
edit the "config" file to suit to your platform - not necessary for Linux - or to change the default installation PREFIX dir (/usr/local).
$ make
# make
install
If you want to install in the default dir you must be root to do the "make install" step. If you don't have root access, you may set PREFIX to point to a dir you have write access to, for instance $HOME/lib. You need to add the new installation dir to LUA_CPATH so that Lua loader can find osbf.so.
Ex: Installing in $HOME/lib
<edit config and set PREFIX to $HOME/lib>
$ mkdir
$HOME/lib
$ make install
$ export LUA_CPATH=$HOME/lib/?.so:
$LUA_CPATH
After osbf module
is properly installed, you may want to install the
spamfilter, a Lua script that uses the OSBF-Lua module to classify
and tag messages as spam or non-spam (ham) according to the score
they get, or to the white/blacklists, if any:
make
install_spamfilter
The spamfilter files are installed in /usr/local/osbf-lua. If the dir doesn't exist it'll be created
The next step is to configure your email account to use the spamfilter:
do the following steps under your account, not as root
create your local osbf-lua dir:
mkdir
$HOME/osbf-lua
create your log and cache dirs:
mkdir
$HOME/osbf-lua/log
mkdir
$HOME/osbf-lua/cache
Note: Old messages in the cache dir should be deleted regularly, typically from a cron job, to preserve disk space. Check Christian Siefkes' trainfilter for his clean-up script.
copy the spamfilter config file to your dir:
cp
/usr/local/osbf-lua/spamfilter_config.lua $HOME/osbf-lua
edit spamfilter_config.lua to set your password
change the current dir to your osbf-lua dir and create the spamfilter databases (spam.cfc and nonspam.cfc)
cd $HOME/osbf-lua
lua /usr/local/osbf-lua/create_databases.lua # change '/usr/local' to your PREFIX
add the following lines to your .procmailrc
#
set OSBF_LUA_DIR to where spamfilter.lua, spamfilter_command.lua etc
were installed
OSBF_LUA_DIR=/usr/local/osbf-lua # change '/usr/local' to your PREFIX
OSBF_LUA_USER_DIR=$HOME/osbf-lua
#
let the Lua interpreter find the "osbf" module.
#
uncomment if you installed a local copy of the osbf module (e.g. no
root access)
#LUA_CPATH=$LUA_CPATH:$HOME/lib/?.so
:0fw:
.msgid.lock
*
< 350000 # don't check messages greater than 350000 bytes
|
$OSBF_LUA_DIR/spamfilter.lua --udir $OSBF_LUA_USER_DIR
OBS: The "osbf-lua" dir and all files and dirs under it must be writable by the user or group that procmail runs under.
Check your installation by sending a message to yourself with the following command in the subject line:
help
<your password>
You should receive a message with a help on the spamfilter. Then, send another command in the subject line to verify that the databases were created correctly:
stats
<your password>
You should get a statistics report on the just created databases.
From now on, all messages you receive with less than max size specified in the procmail recipe will be classified and tagged according to the score they get:
Tag |
Meaning |
[--] |
almost sure it's a spam - score <= -20 |
[-] |
probably it's a spam (reinforcement zone) - score < 0 and > -20 |
[+] |
probably it's not spam (reinforcement zone) - score >=0 and < 20 |
[++]
|
almost sure it's not spam - score >= 20. This tag is here just for symmetry, it's not used. An empty tag is used in place of it so as not to pollute the messages. |
If the classification is wrong you must train the filter replying (you must do a "Reply", not a "Forward") the message back to yourself, replacing the subject with the correspondent training command:
learn
<password> spam
or
learn
<password> nonspam
The body of the message may be erased, it's not required. The original message, temporarily saved on the server by the spamfilter, will be recovered through the SFID (Spam Filter ID), a special mark added to the header when the message arrived to the server.
If you make a mistake, you should undo the training with the command "unlearn". Ex:
unlearn
<password> spam
if wrongly trained as spam.
Training
when the classification is wrong is essential for accuracy. Training
when in the reinforcement zone, called reinforcement,
is
highly recommended for increasing and keeping the accuracy
high. After you have a well trained filter, say 99% or better
accuracy, you may want to reduce the reinforcement zone, eg. [-10,
10], so as not to do many reinforcements a day. You may change the
reinforcement zone, tags, etc, by editing the file spamfilter_config.lua.
The OSBF-Lua lib and spamfilter.lua were designed and implemented by Fidelis Assis, who holds the primary copyright.
The OSB technique, as well as the OSBF classify and learn codes are based on the OSB and OSBF I originally developed for the CRM114 project, as a derivative work based on Bill Yerazunis' CRM114 Markovian classifier. Bill Yerazunis holds the secondary copyright on the OSBF-Lua lib.For more information please email me. Comments are welcome!