OSBF-Lua Reference Manual

Text classification library for the Lua programming language

home · introduction · reference · examples


Introduction

OSBF-Lua (Orthogonal Sparse Bigrams with confidence Factor) is a C module for text classification written for Lua. It is a port of the OSBF classifier implemented in CRM114, http://crm114.sf.net. It borrows many good ideas from Bill Yerazunis' CRM114, like the databases basic structure and the Bayesian chain implementation. The OSBF algorithm is a typical Bayesian classifier but enhanced with the OSB (Orthogonal Sparse Bigrams) feature extraction technique and an ad hoc Confidence Factor (or “voodoo”), for automatic reduction of the less significant features impact on the classification – noise reduction. The final result is a very fast and accurate classifier. It was developed focused on 2 classes, SPAM and NON-SPAM, so the performance with more than 2 classes may not be the same.

OSBF-Lua is free software and is released under the GPL version 2. You can get a copy of the license at GPL. This distribuition includes a copy of the license in the file gpl.txt.

Reference

OSBF-Lua offers the following functions:

text: String with the text to be classified;


dbset: Lua table with the following structure:















Examples

create_databases.lua:

-- Script for creating the databases

require "osbf"

-- class databases to be created
dbset = { classes = {"nonspam.cfc", "spam.cfc"} }

-- number of buckets in each database
num_buckets = 94321

-- remove previous databases with the same name
osbf.remove_db(dbset.classes)

-- create new, empty databases
osbf.create_db(dbset.classes, num_buckets)

----------------------------------------------------------------------------


classify.lua:

-- Script for classifying a message read from stdin

require "osbf"

dbset = {
          classes = {"nonspam.cfc", "spam.cfc"},
          ncfs = 1,
          delimiters = ""
}
classify_flags = 0

-- read entire message into var "text"
text = io.read("*all")

pR, p_array, i_pmax = osbf.classify(text, dbset, classify_flags)

if (pR == nil) then
   print(p_array)  -- in case of error, p_array contains the error message
else
   io.write(string.format("The message score is %f - ", pR))
   if (pR >= 0) then
     io.write("HAM\n")
   else
     io.write("SPAM\n")
   end
end


See more examples of the use of the osbf module in the spamfilter dir.
In special, take a look at the toer.lua script, which is a very fast
way of preparing your databases using a previously classified corpus
with your ham and spam messages.

----------------------------------------------------------------------------

home · introduction · reference · examples