OSBF-Lua Reference Manual |
Text classification library for the Lua programming language |
home · introduction · reference · examples
OSBF-Lua (Orthogonal Sparse Bigrams with confidence Factor) is a C module for text classification written for Lua. It is a port of the OSBF classifier implemented in CRM114, http://crm114.sf.net. It borrows many good ideas from Bill Yerazunis' CRM114, like the databases basic structure and the Bayesian chain implementation. The OSBF algorithm is a typical Bayesian classifier but enhanced with the OSB (Orthogonal Sparse Bigrams) feature extraction technique and an ad hoc Confidence Factor (or “voodoo”), for automatic reduction of the less significant features impact on the classification – noise reduction. The final result is a very fast and accurate classifier. It was developed focused on 2 classes, SPAM and NON-SPAM, so the performance with more than 2 classes may not be the same.
OSBF-Lua is free software and is released under the GPL version 2. You can get a copy of the license at GPL. This distribuition includes a copy of the license in the file gpl.txt.
OSBF-Lua offers the following functions:
osbf.create_db(classes, num_buckets)
Creates the single class databases specified in the table classes, with num_buckets buckets each.
osbf.create returns the number of single class databases created or nil plus an error message.
Ex: osbf.create_db({“nonspam.cfc”, “spam.cfc”}, 94321)
osbf.remove_db
(classes)
Removes all single class databases
specified in the table classes.
classes
is the same as in osbf.create_db.
osbf.remove returns true in case of success or nil plus an error message.
Ex: osbf.remove_db({“nonspam.cfc”, “spam.cfc”})
osbf.classify(text,
dbset, flags, min_p_ratio)
Classifies the string
text.
text: String with the text to be classified;
dbset: Lua table with the following structure:
dbset = { classes = {"nonspam.cfc", "spam.cfc"}, ncfs = 1, delimiters = "" -- you can put additional token delimiters here }
ncfs: splits classes in 2 subsets. The first subset is formed by the first ncfs class databases. The remainder databases will form the second subset. These 2 subsets define 2 composed classes. In the above example we have 2 composed classes formed by a single class database each. Another possibility, for instance, would be 2 composed classes formed by a pair of single class databases each: global and per user. Ex:
dbset = { classes = {"globalnonspam.cfc", "usernonspam.cfc", "globalspam.cfc", "userspam.cfc"}, ncfs = 2, -- 2 single classes in the first subset delimiters = "" }
flags: Number with the flags to control classification. Set to 0 for normal use. Each bit of the number is a flag. For now, there's only one flag defined, the NO_VOODOO flag. That is, set flags to 1 to disable the voodoo formula. The NO_VOODOO flag is intended more for test purposes because disabling it normally lowers accuracy.
min_p_ratio: Number with the minimum feature probability ratio. The probability ratio of a feature is the ratio between the maximum and the minimum probabilities it has over the classes. Features with less than min_p_ratio are not considered for classification. This parameter is optional. The default is 1, which means that all features are considered.
delimiters: String with extra token delimiters. The tokens are produced by the internal fixed pattern ([[:graph:]]+), or, in other words, by sequences of printable chars except tab, new line, vertical tab, form feed, carriage return, or space. If delimiters is not empty, its chars will be considered as extra token delimiters, like space, tab, new line, etc.
osbf.classify returns 3 values, in the following
order:
. pR: The log of the ratio between the probabilities of the first and second subset;
. p_array: a Lua array with each single class probability;
. i_pmax: index of the array to the single class with maximum probability;
In case of error, it returns 2 values: nil
and
an error message.
osbf.learn (text, dbset,
class_index, flags)
Learns the string text
as belonging to the single class database indicated by the number
class_index
in dbset.classes.
text: string with the text to be learned;
dbset: table with the classes. Same structure as in osbf.classify;
class_index: index to the single class, in db.classes, to be trained with text;
flags: Number with the flags to control the learning operation. Set to 0 for normal use. Each bit of the number is a flag. For now, there's only one flag defined, the NO_MICROGROOM flag. That is, set flags to 1 to disable microgrooming. The NO_MICROGROOM flag is intended more for test purposes because the databases have fixed size and the pruning mechanism is necessary to guarantee space for new learnings.
osbf.learn returns true in case of success or nil plus an error message in case of error.
Configures internal parameters. This function is intended more for test purposes.
options: table whose keys are the options to be set to their respective values.
The recognized options are:
max_chain: the max number of buckets allowed in a database chain. From that size on, the chain is pruned before inserting a new bucket;
stop_after: max number of buckets pruned in a chain;
K1, K2, K3: Constants used in the “voodoo” formula;
Return the number of options set.
Ex: osbf.config({max_chain = 50, stop_after = 100})
osbf.stats
(dbfile)
Returns an array with information and
statistics of the specified database.
dbfile: string with the database filename.
In case of error, it returns
nil
plus an error message.
Creates csvfile, a dump of dbfile in CSV format.
dbfile: string with the database filename.
csvfile: string with the csv filename.
In case of error, it returns nil
plus an error message.
osbf.restore (dbfile, csvfile)
Restores dbfile
from cvsfile.
In case of error, it returns nil
plus an error message.
dbfile: string with the database filename.
csvfile: string with the csv filename
In case of error, it returns nil
plus an error message.
dir: string with the
dirname. In case of error, it returns nil
plus an error message.
Returns the current working dir.
In case of error, it returns nil
plus an error message.
create_databases.lua: -- Script for creating the databases require "osbf" -- class databases to be created dbset = { classes = {"nonspam.cfc", "spam.cfc"} } -- number of buckets in each database num_buckets = 94321 -- remove previous databases with the same name osbf.remove_db(dbset.classes) -- create new, empty databases osbf.create_db(dbset.classes, num_buckets) ---------------------------------------------------------------------------- classify.lua: -- Script for classifying a message read from stdin require "osbf" dbset = { classes = {"nonspam.cfc", "spam.cfc"}, ncfs = 1, delimiters = "" } classify_flags = 0 -- read entire message into var "text" text = io.read("*all") pR, p_array, i_pmax = osbf.classify(text, dbset, classify_flags) if (pR == nil) then print(p_array) -- in case of error, p_array contains the error message else io.write(string.format("The message score is %f - ", pR)) if (pR >= 0) then io.write("HAM\n") else io.write("SPAM\n") end end See more examples of the use of the osbf module in the spamfilter dir. In special, take a look at the toer.lua script, which is a very fast way of preparing your databases using a previously classified corpus with your ham and spam messages. ----------------------------------------------------------------------------
home · introduction · reference · examples