GitHub - neoaksa/Movie-Review-by-Native-Bayers

Naive Bayes Document Classification

The data used is the IMDB Movie Review dataset. The code was written in python, and a method of creating two dictionaries was used. Dictionaries consisted of the words in the reviews of each class and conditional probability of each word in two groups (positive, negative) respectively. Since the samples for training in two groups are 50-50 ratio, we set the P(Cj) as 0.5 as default.

We used the NLTK package to split the sentence to the words and count frequency of each word, see function description in Table 1-1:

Package	Desc
nltk.pos_tag	Append word type
wordnet.lemmatize	Change verb tense to presentChange adjective to ordinalChange adverb to ordinal
wordnet.morphy	Change plural to singular
stopwords	Filter all stopwords
word_type_list_In	Type of words included
word_type_list_Ex	Type of words excluded( higher priority than included words list according to type level), like some stop words
word_type_Ex	Words excluded. Some useless words like "<","br","movie","film", etc
ngrams	Combine the words nearby, initial n =2
Multiprocessing, partial	Multi processors acceleration

Table 1-1. The packages used and their brief description.

Since P(w|C) is very small number, we set the type of P as decimal. In the process of generating dictionaries and validation, we code multiprocessing to accelerate these processes.If the word doesn't exit either of two dictionaries, we will use 1/total number of dictionary as conditional probability which is a very tiny number.

A dictionary without any filter will be huge and full of noisey words. Thus, we only keep

Effectiveness of Classifier

To assess the effectiveness of the model, we determined the accuracy of positive and negative reviews using the provided test set. Since there are over 12,000 samples for each positive and negative test sets, we randomly selected examples to validate our model. Taking an increasing number of samples in the validation step gives us a decent idea of the true accuracy of the model. Using 5,000 validation samples, the accuracy of correctly predicting a positive review was 0.7088, while that for a negative review was 0.9422. The results are shown in Table 2-1.

Validation Samples	Positive Accuracy	Negative Accuracy
50	0.76	0.90
100	0.71	0.93
500	0.736	0.942
1000	0.742	0.955
2000	0.722	0.9475
5000	0.7088	0.9422

Table 2-1. Positive and negative accuracy for different numbers of validation samples for 2-gram model.

Since the above model used n-grams of 2, the next model we tried used n-grams of 3. The same relative pattern emerged, with the model more accurately classifying negative reviews. However, this time at 5,000 validation samples, the positive accuracy is slightly less for the 3-gram model than the 2-gram model, while the the negative accuracy is slightly higher. The results are shown in Table 2-2.

Validation Samples	Positive Accuracy	Negative Accuracy
50	0.70	0.92
100	0.70	0.93
500	0.702	0.956
1000	0.702	0.965
2000	0.681	0.9575
5000	0.6674	0.9536

Table 2-2. Positive and negative accuracy for different numbers of validation samples for 3-gram model.

To get a deeper understanding of how changing the value of n in the n-gram function, n was set to 10 for the following testing. The positive accuracy is much lower for the 10-gram model compared to the 2- and 3-gram models, while the negative accuracy is slightly higher.

Validation Samples	Positive Accuracy	Negative Accuracy
50	0.44	1.0
100	0.42	1.0
500	0.44	0.99
1000	0.453	0.992
2000	0.429	0.9895
5000	0.4144

The above were all without excluding first and third person present tense verbs via the nltk package. We decided to remove those verbs and rerun the model. The following analysis excludes them.The different levels of n-grams were tested with 5,000 validation samples. While the negative accuracy decreased slightly, the positive accuracy increased. The greatest accuracy increase for the positive class came when the first and third person present tense verbs were excluded for the 10-gram model, as it increased from 41.4% to 51.4%. Although this is a substantial increase, 51% accuracy is not great. The decrease in negative accuracy is not much to worry about, as it remains over 90% for each of the three models. The results can be seen in Table 2-4.

n-gram	Positive Accuracy	Negative Accuracy
2	0.7624	0.9186
3	0.7132	0.9314
10	0.5148	0.9786

Table 2-4. Positive and negative accuracy for different n-gram models without first and third person present tense verbs.

Results

As mentioned above, the model was much better at classifying negative reviews than positive ones. Without looking at the actual reviews, this must mean that there are some words that occur more often in the negative reviews than in the positive reviews.

In the specific case of this data, IMDB may want to highlight a select few of the positive and negative reviews for quick reference for a user when viewing a page for a particular movie. This model would do well at showing negative reviews, but not all of those selected as positive reviews will actually be positive.

While we were content with the accuracy of the classification of the negative reviews, we would have liked to increase the positive classification accuracy. To attempt to increase the positive accuracy, we needed to create new dictionaries. This is very time consuming, so we only created 3 sets of dictionaries before deciding to settle on the third set. Given more time, we would create more dictionaries and try to limit the number of words in common between the two. This would theoretically provide a greater number of unique words in the positive and negative dictionary. Another method on the short list of ways to investigate impact on the model is removing more or different types of words, in addition to or in place of the first and third person verbs.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.idea		.idea
__pycache__		__pycache__
data		data
.gitattributes		.gitattributes
BayersDict.py		BayersDict.py
README.md		README.md
main.py		main.py
test.py		test.py
visual.py		visual.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

neoaksa/Movie-Review-by-Native-Bayers

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages