"beiyesi" is a bayesian text classifier in python.
execute
sudo python setup.py install
Initiate a python object
>>> import beiyesi
>>> clss = beiyesi.classifier.Classifier()
each line is a doc, like below
docid label1,label2 word1 word2 word3
- docid is a string that uniquely identifies a doc. Multiple docs with the same docid are ignored except the first one.
- A doc can have one or more labels, separated by comma. ( no spaces around the comma)
- The rest of the line is a list of words separated by spaces
The classifier should be fed with a list of docs that are already labeled
>>> clss.trainLine("1 ham aa bb cc")
>>> clss.trainLine("2 spam xx yy zz")
>>> words = ['aa','cc']
>>> print clss.classifyDoc(words)
[('ham', 0.017492711370262388), ('spam', 0.0043731778425655969)]
As the classification result, the probability for each label is returned in a list. In the order of highest probability first. So x[0][0] is the name of the most possible label, x[0][1] is its probability
>>> clss.explain('ham', words)
(0.017492711370262388, 0.21428571428571427, [('aa', 0.2857142857142857), ('cc', 0.2857142857142857)])
It explains that for the given doc "words", how its probablity of being label "ham" is calculated.
- The first float 0.017492711370262388 is the final result as returned by classifyDoc
- The second float 0.21428571428571427 is the prior probablity of label "ham"
- The last list [('aa', 0.2857142857142857), ('cc', 0.2857142857142857)] shows that for each word in the doc, the probablity that it contributes to the result. In the order of highest probablity word first.
I am going to skip explaining the bayesian algorithm here. Naive Bayes classifier in 50 lines is a good article I refered to.
I implemented the "add-one-smoothing" as described by the article above.
I downloaded the test data set from SIAM2007. You can apply beiyesi to it following the next steps.
-
Enter the test directory
cd beiyesi/test
-
Following script converts the data set to our format. Then runs the tests.
./run.sh
total=3083, wrong=2398, score=0.222186
It doesn't do well on this data set. But it's still ok for my personal projects. So your miles may very depending on the data you need to process.