OSBF-Lua Reference Manual |
Text classification library for the Lua programming language |
home · introduction · reference · examples
OSBF-Lua (Orthogonal Sparse Bigrams with confidence Factor) is a C module for text classification written for Lua. It is a port of the OSBF classifier implemented in CRM114, http://crm114.sf.net. It borrows many good ideas from Bill Yerazunis' CRM114, like the databases basic structure and the Bayesian chain implementation. The OSBF algorithm is a typical Bayesian classifier but enhanced with the OSB (Orthogonal Sparse Bigrams) feature extraction technique and an ad hoc Confidence Factor (or “voodoo”), for automatic reduction of the less significant features impact on the classification – noise reduction. The final result is a very fast and accurate classifier. It was developed focused on 2 classes, SPAM and NON-SPAM, so the performance with more than 2 classes may not be the same.
OSBF-Lua is free software and is released under the GPL version 2. You can get a copy of the license at GPL. This distribuition includes a copy of the license in the file gpl.txt.
OSBF-Lua offers the following functions:
osbf.create_db(classes, num_buckets)
Creates the single class databases specified in the table classes, with num_buckets buckets each.
osbf.create returns the number of single class databases created or nil plus an error message.
Ex: osbf.create_db({“nonspam.cfc”, “spam.cfc”}, 94321)
osbf.remove_db
(classes)
Removes all single class databases
specified in the table classes.
classes is the
same as in osbf.create_db.
osbf.remove returns true in case of success or nil plus an error message.
Ex: osbf.remove_db({“nonspam.cfc”, “spam.cfc”})
osbf.classify(text,
dbset, flags, min_p_ratio)
Classifies the string
text.
text: String with the text to be classified;
dbset: Lua table with the following structure:
dbset = { classes = {"nonspam.cfc", "spam.cfc"}, ncfs = 1, delimiters = "" -- you can put additional token delimiters here }
ncfs: splits classes in 2 subsets. The first subset is formed by the first ncfs class databases. The remainder databases will form the second subset. These 2 subsets define 2 composed classes. In the above example we have 2 composed classes formed by a single class database each. Another possibility, for instance, would be 2 composed classes formed by a pair of single class databases each: global and per user. Ex:
dbset = { classes = {"globalnonspam.cfc", "usernonspam.cfc", "globalspam.cfc", "userspam.cfc"}, ncfs = 2, -- 2 single classes in the first subset delimiters = "" }
flags: Number with the flags to control classification. Set to 0 for normal use. Each bit of the number is a flag. For now, there's only one flag defined, the NO_VOODOO flag. That is, set flags to 1 to disable the voodoo formula. The NO_VOODOO flag is intended more for test purposes because disabling it normally lowers accuracy.
min_p_ratio: Number with the minimum feature probability ratio. The probability ratio of a feature is the ratio between the maximum and the minimum probabilities it has over the classes. Features with less than min_p_ratio are not considered for classification. This parameter is optional. The default is 1, which means that all features are considered.
delimiters: String with extra token delimiters. The tokens are produced by the internal fixed pattern ([[:graph:]]+), or, in other words, by sequences of printable chars except tab, new line, vertical tab, form feed, carriage return, or space. If delimiters is not empty, its chars will be considered as extra token delimiters, like space, tab, new line, etc.
osbf.classify
returns 3 values, in the following order:
. pR: The log of the ratio between the probabilities of the first and second subset;
. p_array: a Lua array with each single class probability;
. i_pmax: index of the array to the single class with maximum probability;
In case of error, it returns 2 values: nil
and
an error message.
osbf.learn
(text, dbset, class_index, flags)
Learns the
string text as belonging to the
single class database indicated by the number class_index
in dbset.classes.
text: string with the text to be learned;
dbset: table with the classes. Same structure as in osbf.classify;
class_index: index to the single class, in db.classes, to be trained with text;
flags: Number with the flags to control the learning operation. Set to 0 for normal use. Each bit of the number is a flag. For now, there's only one flag defined, the NO_MICROGROOM flag. That is, set flags to 1 to disable microgrooming. The NO_MICROGROOM flag is intended more for test purposes because the databases have fixed size and the pruning mechanism is necessary to guarantee space for new learnings.
osbf.learn returns true in case of success or nil plus an error message in case of error.
Configures internal parameters. This function is intended more for test purposes.
options: table whose keys are the options to be set to their respective values.
The recognized options are:
max_chain: the max number of buckets allowed in a database chain. From that size on, the chain is pruned before inserting a new bucket;
stop_after: max number of buckets pruned in a chain;
K1, K2, K3: Constants used in the “voodoo” formula;
limit_token_size: limit token size to max_token_size, if different from 0. The default value is 0;
max_token_size: maximum number of chars in a token. The default is 60. This limit is observed if limit_token_size is different from 0;
max_long_tokens: long tokens, with more than max_token_size, are normally collapsed into a single hash, as if they were a single token. This is mainly to avoid database pollution with the many “tokens” found in encoded attachments. max_long_tokens defines the maximum number of consecutive long tokens that are collapsed into a single token.
Return the number of options set.
Ex: osbf.config({max_chain = 50, stop_after = 100})
osbf.stats
(dbfile)
Returns an array with information and
statistics of the specified database.
dbfile: string with the database filename.
In case of error, it returns
nil
plus an
error message.
Creates csvfile, a dump of dbfile in CSV format. Its main use is to transport dbfiles between different architectures (Intel x Sparc for instance). A dbfile in CSV format can be restored in another architecture using the osbf.restore function below.
dbfile: string with the database filename.
csvfile: string with the csv filename.
In case of error, it returns nil
plus an error message.
osbf.restore (dbfile, csvfile)
Restores dbfile from cvsfile. Be careful, if dbfile exists it'll be rewritten. Its main use is to restore a dbfile in CVS format dumped in a different architecture.
dbfile: string with the database filename.
csvfile: string with the csv filename
In case of error, it returns nil
plus an error message.
Imports the buckets in csvfile into dbfile. dbfile must exist prior to the import. Buckets originally present in dbfile will be preserved as long as the microgoomer doesn't delete them to make room for the new ones. If you import into a nonempty dbfile, that is, merge dbfiles, you must guarantee that the databases being merged have learned distinct messages, otherwise the final learning counter will be inflated because of duplicates, with bad effect on accuracy.
dbfile: string with the database filename.
csvfile: string with the csv filename
In case of error, it returns nil
plus an error message.
dir: string with the dirname.
In case of error, it
returns nil
plus
an error message.
Returns the current working dir.
In case of error, it returns nil
plus an error message.
create_databases.lua: -- Script for creating the databases require "osbf" -- class databases to be created dbset = { classes = {"nonspam.cfc", "spam.cfc"} } -- number of buckets in each database num_buckets = 94321 -- remove previous databases with the same name osbf.remove_db(dbset.classes) -- create new, empty databases osbf.create_db(dbset.classes, num_buckets) ---------------------------------------------------------------------------- classify.lua: -- Script for classifying a message read from stdin require "osbf" dbset = { classes = {"nonspam.cfc", "spam.cfc"}, ncfs = 1, delimiters = "" } classify_flags = 0 -- read entire message into var "text" text = io.read("*all") pR, p_array, i_pmax = osbf.classify(text, dbset, classify_flags) if (pR == nil) then print(p_array) -- in case of error, p_array contains the error message else io.write(string.format("The message score is %f - ", pR)) if (pR >= 0) then io.write("HAM\n") else io.write("SPAM\n") end end See more examples of the use of the osbf module in the spamfilter dir. In special, take a look at the toer.lua script, which is a very fast way of preparing your databases using a previously classified corpus with your ham and spam messages. ----------------------------------------------------------------------------
home · introduction · reference · examples