OSBF-Lua Reference Manual |
Text classification module for the Lua programming language |
home · introduction · reference · examples
OSBF-Lua (Orthogonal Sparse Bigrams with confidence Factor) is a Lua C module for text classification. It is a port of the OSBF classifier implemented in the CRM114 project. This implementation attempts to put focus on the classification task itself by using Lua as the scripting language, a powerful yet light-weight and fast language, which makes it easier to build and test more elaborated filters and training methods.
OSBF-Lua
is free software and is released under the GPL version 2. You can get
a copy of the license at GPL.
This distribution includes a copy of the license in the file
gpl.txt.
OSBF-Lua offers the following functions:
osbf.create_db(classes, num_buckets)
Creates the single class databases specified in the table classes, with num_buckets buckets each.
osbf.create returns the number of single class databases created or nil plus an error message.
Ex:
osbf.create_db({"nonspam.cfc", "spam.cfc"}, 94321)
osbf.remove_db
(classes)
Removes all single class databases specified in the table classes. classes is
the same as in osbf.create_db.
osbf.remove_db returns true in case of success or nil plus an error message.
Ex: osbf.remove_db({"nonspam.cfc",
"spam.cfc"})
osbf.classify(text,
dbset, flags, min_p_ratio)
Classifies the string text.
text: String with the text to be classified;
dbset: Lua table with the following structure:
dbset = {
classes =
{"nonspam.cfc", "spam.cfc"},
ncfs = 1,
delimiters = "" -- you can put additional token delimiters here
}
classes: classes for classification.
ncfs: split classes in 2 subsets. The first subset is formed by the first ncfs class databases. The remainder databases will form the second subset. These 2 subsets define 2 composed classes. In the above example we have 2 composed classes formed by a single class database each. Another possibility, for instance, would be 2 composed classes formed by a pair of single class databases each: global and per user. Ex:
dbset = {
classes =
{"globalnonspam.cfc", "usernonspam.cfc", "globalspam.cfc",
"userspam.cfc"},
ncfs = 2,
-- 2 single classes in the first subset
delimiters = ""
}
flags:
Number with the classification control flags. Each bit is a flag. The
available flags are:
min_p_ratio:
Number with the minimum feature probability ratio. The probability
ratio of a feature is the ratio between the maximum and the minimum
probabilities it has over the classes. Features with less than
min_p_ratio are not considered for classification. This parameter is
optional. The default is 1, which means that all features are
considered.
delimiters: String with extra token delimiters.
The tokens are produced by the internal fixed pattern ([[:graph:]]+),
or, in other words, by sequences of printable chars except tab, new
line, vertical tab, form feed, carriage return, or space. If delimiters
is not empty, its chars will be considered as extra token delimiters,
like space, tab, new line, etc.
pR: The log of the ratio between the probabilities of the first and second subset;
p_array: a Lua array with each single class probability;
i_pmax:
index of the array to the single class with maximum probability;
osbf.learn
(text, dbset, class_index, flags)
Learns the string text as
belonging to the single class database indicated by the number
class_index in dbset.classes.
text: string with the text to be
learned;
dbset: table with the
classes. Same structure as in osbf.classify;
flags: Number with the flags to control the learning operation. Each bit is a flag. The available flags are:
NO_MICROGROOM = 1 - disable microgrooming;
MISTAKE = 2 - increment the mistake counter, besides the learning counter;
EXTRA_LEARNING = 4 - increment the extra-learning, or reinforcement, counter, besides the learning counter;
The NO_MICROGROOM flag is more intended for tests because the databases have fixed size and the pruning mechanism is necessary to guarantee space for new learnings. The MISTAKE and the EXTRA_LEARNING flags shouldn't be used simultaneously.
osbf.learn returns true in case of success or nil plus an error message in case of error.
Configures internal parameters. This function is intended more for test purposes.
options: table whose keys are the options to be set to their respective values.
The available options are:
max_chain: the max number of buckets allowed in a database chain. From that size on, the chain is pruned before inserting a new bucket;
stop_after: max number of buckets pruned in a chain;
K1, K2, K3: Constants used in the EDDC formula;
limit_token_size: limit token size to max_token_size, if not equal to 0. The default value is 0;
max_token_size: maximum number of chars in a token. The default is 60. This limit is observed if limit_token_size is different from 0;
max_long_tokens: sequences with more than max_long_tokens tokens where the tokens are greater than max_token_size are collapsed into a single hash, as if they were a single token. This is to reduce database pollution with the many "tokens" found in encoded attachments.
Return the number of options set.
Ex: osbf.config({max_chain
= 50, stop_after = 100})
osbf.stats
(dbfile [, full])
Returns a table with information and statistics of the specified
database. The keys of the table are:
version - version of the module;
buckets – total number of buckets in the database;
bucket_size - size of the bucket, in bytes;
header_size - size of the header, in buckets;
learnings - number of learnings;
extra_learnings - number of extra learnings done internally when a single learning is not enough;
classifications
– number of classifications;
chains – number of bucket chains;
max_chain – length of the max chain;
avg_chain – average length of a chain;
max_displacement – max distance a bucket is from the “right” place
used_buckets – number of used buckets
use – percentage of used buckets
dbfile:
string with the database filename.
full: optional boolean argument. If present and equal to false only the values already in the header of the database are returned, that is, the values for the keys version, buckets, bucket_size, header_size, learnings, extra_learnings, classifications and mistakes. If full is equal to true, or not given, the complete statistics is returned. For large databases, osbf.stats is much faster when full is equal to false.
In case of error, it
returns nil
plus an error message.
Creates csvfile, a dump of dbfile in CSV format. Its main use is to transport dbfiles between different architectures (Intel x Sparc for instance). A dbfile in CSV format can be restored in another architecture using the osbf.restore function below.
dbfile: string with the database filename.
csvfile: string with the csv filename.
In case of error, it
returns nil
plus an
error message.
osbf.restore (dbfile, csvfile)
Restores dbfile from cvsfile. Be careful, if dbfile exists it'll be rewritten. Its main use is to restore a dbfile in CVS format dumped in a different architecture.
dbfile: string with the database filename.
csvfile: string with the csv filename
In case of error, it
returns nil
plus an error message.
osbf.import (to_dbfile, from_dbfile)
Imports the buckets in from_dbfile into to_dbfile. from_dbfile must exist. Buckets originally present in to_dbfile will be preserved as long as the microgroomer doesn't delete them to make room for the new ones. The counters (learnings, classifications, mistakes, etc), in the destination database will be incremented by the respective values in the origin database. The main purpose of this function is to expand or shrink a database, importing into a larger or smaller empty one.
to_dbfile: string with the database filename.
from_dbfile: string with the database filename
In case of error, it
returns nil
plus an error message.
dir: string with the database filename.
In
case of error, it returns nil
plus an error message.
Returns the current
working dir. In case of error, it returns nil plus an error
message.
------------------------------------------------------------------
-- create_databases.lua: Script for creating the
databases
require "osbf"
-- class databases to be created
dbset = { classes = {"nonspam.cfc", "spam.cfc"} }
-- number of buckets in each database
num_buckets = 94321
-- remove previous databases with the same name
osbf.remove_db(dbset.classes)
-- create new, empty databases
osbf.create_db(dbset.classes, num_buckets)
------------------------------------------------------------------
-- classify.lua: Script for classifying a message read from stdin
require "osbf"
dbset = {
classes = {"nonspam.cfc", "spam.cfc"},
ncfs = 1,
delimiters = ""
}
classify_flags = 0
-- read entire message into var "text"
text = io.read("*all")
pR, p_array, i_pmax = osbf.classify(text, dbset, classify_flags)
if (pR == nil) then
print(p_array) -- in case of error,
p_array contains
-- the error
message
else
io.write(string.format("The message
score is %f - ", pR))
if (pR >= 0) then
io.write("HAM\n")
else
io.write("SPAM\n")
end
end
------------------------------------------------------------------