OSBF-Lua |
Text classification library for the Lua Programming Language |
overview · download · manual · credits · contact
OSBF-Lua (Orthogonal Sparse Bigrams with confidence Factor) is a C module for text classification written for Lua. It is a port of the OSBF classifier implemented in CRM114, http://crm114.sf.net. It borrows many good ideas from Bill Yerazunis' CRM114, like the databases basic structure and the Bayesian chain implementation. The OSBF algorithm is a typical Bayesian classifier but enhanced with the OSB (Orthogonal Sparse Bigrams) feature extraction technique and an ad hoc Confidence Factor (aka “voodoo”), for automatic reduction of the less significant features impact on the classification – noise reduction. The final result is a very fast and accurate classifier. It was developed focused on 2 classes, SPAM and NON-SPAM, so the performance for more than 2 classes may not be the same.
I originally developed both OSB and OSBF for the CRM114 project, but the OSB technique was officially announced in the paper “Combining Winnow with Orthogonal Sparse Bigrams for Incremental Spam Filtering“, a work headed and presented by Christian Siefkes in the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), on September 2004.
This implementation attempts to put focus on the classification task itself by using Lua, a powerful language, yet light-weight and fast, which makes it easier to build and test more elaborated filters and training methods.
OSBF-Lua is free software and is released under the GPL version 2. You can get a copy of the license at GPL. This distribuition includes a copy of the license in the file gpl.txt.
The sources can be downloaded from http://luaforge.net/projects/osbf-lua
Alessandro Martins maintains a package for Slackware at http://www.martins.eng.br/slackware/osbf-lua
[08/Jan/2006] Version 1.5.3b
Fixes to the osbf module
Fixed the database restore function (osbf.restore);
Changed the osbf.so link from absolute to relative to make it simpler to generate the Slackware package – suggested by Alessandro Martins <alessandro@martins.eng.br>.
Improvements and fixes to spamfilter (v1.1.3):
Better detection of the “Subject:” header line;
Improved scan for a command in the subject line. Now it'll detect a command even if another filter in the middle mistakenly adds a tag to the beginning of the subject line. Problem pointed out by Pavel Kolar.
[01/Jan/2006] Version 1.5.2b
Improvements and fixes to spamfilter:
The recover command now sends the recovered message as an attachment;
Added a new config option, osbf.cfg_remove_body_threshold, to remove the body of spam messages. Setting osbf.cfg_remove_body_threshold = 20 in spamfilter_config.lua removes the body of all spam messages with score greater than 20. The original message is still available with the recover command, if needed;
Fixed a problem that occurred when a command-message was sent in HTML format. Because of the Content-Type header in the original message, the answer, in plain text format, was not visible;
Fixed a bug in the password parsing. An invalid password was accepted as OK if it started with the valid password as a substring and was the last string in the command.
Improvements to the lib
New function added, osbf.config, to allow internal parameter adjustments. This function is more intended for experiments and debugging.
[15/Nov/2005] Version 1.5.1b
Improvements and fixes to spamfilter, toer.lua and docs:
All X-OSBF headers were merged into a single one as suggested by Pavel Kolar <kolar@fzu.cz>:
Ex: X-OSBF-Lua-Score: 33.63/0.00 [H] (v1.5.1b, Spamfilter v1.1);
White and blacklisted messages are now classified too, so that the score in the header X-OSBF-Lua-Score is the real one, as if they hadn't been listed – suggested by Pavel Kolar. The subject tags for blacklisted and whitelisted messages are the same as configured for spam and ham in the config file, respectively;
The tags in the X-OSBF-Lua-Score header don't follow the subject tags defined in the config file any more. They're now fixed: [B] [S], [s], [h], [H], [W] for blacklisted, spam, spam reinforcement, ham reinforcement, ham and whitelisted, according to the classification;
White and black lists don't use Lua regex by default any more. There's a new option in the config file to turn regex on or off: osbf.cfg_lists_use_regex;
Removed the trailing spaces from the subject tags in the config file. They're now added internally;
Removed duplicate database use info showed by the “stats <pwd>” command;
The var unlearn_threshold in spamfilter_commands.lua is now an option in the config file, as it should: osbf.cfg_unlearn_threshold;
More consistent thresholds checking in toer.lua;
DSTTT is now the default training method in toer.lua.
Added the script roc.lua, which calculates 1-ROCAC%, a measure of the quality of the classifier.
[06/Nov/2005] Version 1.5b - first public release
Re-tuning of internal parameters, after the chain rule fix, resulting in improved accuracy.
Docs and example scripts updated.
[30/Sep/2005] Version 1.4b - internal use only
Changed seen_features and other flags data struture to a separate array of unsigned chars, in the learn function.
[25/Sep/2005] Version 1.3b - internal use only
C and Lua codes updated for lua-5.1-alpha
No more captures in string.find
Use of new string.mach
[08/Sep/2005] Version 1.2b - internal use only
Fixed an old bug in the chain rule that caused bad accuracy with some corpus. It sometimes would also cause unexpected worse scores after training, as if one had done an “unlearn”;
Fixed a bug in the “unlearn” code that caused broken chains in the databases;
Implemented a new training method acting on both, spam and ham, databases simultaneously, doing a “learn” on the right database and an “unlearn” on the opposite if the score improvement was not enough. Now, both toer.lua and spamfilter.lua use this new method;
[25/Aug/2005] Version 1.1b - internal use only
Changed the training method used by the spamfilter. Now the original message is saved under a unique SpamFilter ID (SFID) on the server and the original message is sent to the user with the SFID added as a comment to its “Message-ID” header. The original message is recovered, using the SFID sent back by the user's mail client, in the “In-Reply-To” or “References” header, when he does a “Reply” for training.
[13/May/2005] Version 1.0b18 - internal use only
[16/Mar/2005] Version 1.0b12 - internal use only
[28/Jan/2005] Version 1.0b1 – internal use only
OSBF-Lua follows the package proposal for Lua 5.1 and requires Lua 5.1 installed with dynamic loading enabled. OSBF-Lua was developed and tested under the Lua 5.1 work, alpha and beta versions. It won't work with previous versions.
Installation steps:
Install Lua with dynamic loading enabled:
For linux, execute “make linux” and “make install”. For other OS, read the instructions in the INSTALL file and your OS documentation on how to create shared libs.
You might want to change the occurrences of the O2 flag in CFLAGS to O3, in all makefiles, for increased speed.
Install the OSBF-Lua module:
tar xvzf osbf-lua-x.y.z.tar.gz
cd osbf-lua-x.y.z
<edit the “config” file to suit to your platform – not necessary for Linux>
make
make install
You must be root to do the “make install” step. If you don't have root access, you must copy the just created libosbf.so-x.y.z to a directory you have access to, under the name “osbf.so”, and add that dir to LUA_PATH. Ex:
mkdir $HOME/lib
cp libosbf.so-x.y.z $HOME/lib/osbf.so
LUA_PATH=$LUA_PATH:$HOME/lib
After the osbf module is properly installed, you may want to install the spamfilter, a Lua script that uses the OSBF-Lua module to classify and tag messages as spam or non-spam (ham) according the the score they get, or to the white/blacklists, if any:
make install_spamfilter
The spamfilter files are installed in /usr/local/osbf-lua. If the dir doesn't exist it'll be created
The next step is to configure your email account to use the spamfilter:
do the following steps under your account, not as root
create your local osbf-lua dir:
mkdir $HOME/osbf-lua
create your log dir:
mkdir $HOME/osbf-lua/log
copy the spamfilter config file to your dir:
cp /usr/local/osbf-lua/spamfilter_config.lua $HOME/osbf-lua
edit spamfilter_config.lua to set your password
change the current dir to your osbf-lua dir and create the spamfilter databases
cd $HOME/osbf-lua
lua /usr/local/osbf-lua/create_databases.lua
add the following lines to your .procmailrc
# set OSBF_LUA_DIR to where spamfilter.lua, spamfilter_command.lua etc were installed
OSBF_LUA_DIR=/usr/local/osbf-lua
OSBF_LUA_USER_DIR=$HOME/osbf-lua
# let the Lua interpreter find the “osbf” module.
# uncomment if you installed a local copy of the osbf module (e.g. no root access)
#LUA_PATH=$LUA_PATH:$HOME/lib
:0fw: .msgid.lock
* < 350000 # don't check messages greater than 350000 bytes
| $OSBF_LUA_DIR/spamfilter.lua --gdir $OSBF_LUA_DIR --udir $OSBF_LUA_USER_DIR
Check your installation sending a message to yourself with the following command in the subject line:
help <your password>
You should receive a message with a help on the spamfilter. Then, send another command in the subject line to verify that the databases were created correctly:
stats <your password>
You should get a statistics report on the just created databases.
From now on, all messages you receive with less than max size specified in the procmail recipe will be classified and tagged according to the score they get:
Tag |
Meaning |
[--] |
almost sure it's a spam - score <= -20 |
[-] |
probably it's a spam (reinforcement zone) - score < 0 and > -20 |
[+] |
probably it's not spam (reinforcement zone) – score >=0 and < 20 |
[++]
|
almost sure it's not spam – score >= 20. This tag is here just for symmetry, it's not used. An empty tag is used in place of it so as not to pollute the messages. |
If the classification is wrong you must train the filter replying (you must do a “Reply”, not a “Forward”) the message back to yourself, replacing the subject with the correspondent training command:
learn <password> spam
or
learn <password> nonspam
The body of the message may be erased, it's not required. The original message, temporarily saved on the server by the spamfilter, will be recovered through the SFID (Spam Filter ID), a special mark added to the header when the message arrived to the server.
If you make a mistake, you should undo the training with the command “unlearn”. Ex:
unlearn <password> spam
if wrongly trained as spam.
Training when the classification is wrong is essential for accuracy. Training when in the reinforcement zone, called reinforcement, is highly recommended for increasing and keeping the accuracy high. After you have a well trained filter, say 99% or better accuracy, you may want to reduce the reinforcement zone, eg. [-10, 10], so as not to have to do many reinforcements a day. You may change the reinforcement zone, tags, etc, by editing the spamfilter_config.lua file.
The OSBF-Lua lib and spamfilter.lua were designed and implemented by Fidelis Assis, who holds the copyright.
The OSB technique, as well as the OSBF classify and learn codes are based on the OSB and OSBF I originally developed for the CRM114 project (http://crm114.sf.net), as a derivative work based on Bill Yerazunis' CRM114 Markovian classifier.
For more information please email me. Comments are welcome!