Sentiment analysis of tweet dataset using the sentiment140 dataset from kaggle. https://www.kaggle.com/kazanova/sentiment140
file structure required for the project:
project/
| input/
| | dataset.csv
| controller.py
| evaluate.py
| predict.py
| read_data.py
To run the code:
python3 controller.py
If you want to modify what parts of the code are run, edit controller.py
This project was created for the cps803 class at Ryerson University. In it I use 3 models: LinearSVC, BernoulliNB and LogisticRegression to categorise tweets into 2 sentiment categories. I explore the different ways you can prepare data for these models and how well each of these preparations performs.
Warning: it takes a long time to run, make sure you start it before making dinner or something.
The negative word cloud generated by the preprocessed data:
I wasn't kidding about how long it takes. This is only for training the models. I doesn;t take into account all of the preprocessing.
Presented here are the results of the models on all data preparations. We see that the models do not perform well with word vector data
The confusion matrix for the best performing model.