beiyesi

"beiyesi" is a bayesian text classifier in python.

Installation

execute

sudo python setup.py install

Initiate a python object

>>> import beiyesi
>>> clss = beiyesi.classifier.Classifier()

Usage

Data File Format

each line is a doc, like below

docid label1,label2 word1 word2 word3

docid is a string that uniquely identifies a doc. Multiple docs with the same docid are ignored except the first one.
A doc can have one or more labels, separated by comma. ( no spaces around the comma)
The rest of the line is a list of words separated by spaces

Training the classifier

The classifier should be fed with a list of docs that are already labeled

>>> clss.trainLine("1 ham aa bb cc")
>>> clss.trainLine("2 spam xx yy zz")

Classify an unknown doc

>>> words = ['aa','cc']
>>> print clss.classifyDoc(words) 
[('ham', 0.017492711370262388), ('spam', 0.0043731778425655969)]

As the classification result, the probability for each label is returned in a list. In the order of highest probability first. So x[0][0] is the name of the most possible label, x[0][1] is its probability

Explain how the probability is calculated

>>> clss.explain('ham', words)
(0.017492711370262388, 0.21428571428571427, [('aa', 0.2857142857142857), ('cc', 0.2857142857142857)])

It explains that for the given doc "words", how its probablity of being label "ham" is calculated.

The first float 0.017492711370262388 is the final result as returned by classifyDoc
The second float 0.21428571428571427 is the prior probablity of label "ham"
The last list [('aa', 0.2857142857142857), ('cc', 0.2857142857142857)] shows that for each word in the doc, the probablity that it contributes to the result. In the order of highest probablity word first.

Math

I am going to skip explaining the bayesian algorithm here. Naive Bayes classifier in 50 lines is a good article I refered to.

Smoothing

I implemented the "add-one-smoothing" as described by the article above.

Examples

I downloaded the test data set from SIAM2007. You can apply beiyesi to it following the next steps.

Enter the test directory

cd beiyesi/test
Following script converts the data set to our format. Then runs the tests.

./run.sh

total=3083, wrong=2398, score=0.222186

It doesn't do well on this data set. But it's still ok for my personal projects. So your miles may very depending on the data you need to process.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

beiyesi

Installation

Usage

Data File Format

Training the classifier

Classify an unknown doc

Explain how the probability is calculated

Math

Smoothing

Examples

Files

README.md

Latest commit

History

README.md

File metadata and controls

beiyesi

Installation

Usage

Data File Format

Training the classifier

Classify an unknown doc

Explain how the probability is calculated

Math

Smoothing

Examples