This is a project aimed at training and evaluating various classifiers on news articles collected from NU.nl, to then predict their popularity as expressed in number of comments.
crawling
contains a script to collect articles from the news site, save them to a database,
and update them with the number of comments they have received.
preprocessing
contains a script for preprocessing all text in the collected articles.
learning
contains scripts to transform the collected data into input for the classifiers,
and a script to train and evaluate classifiers on the data.
Currently, when trained on a thousand articles, the multinomial Naive Bayes classifier can classify 50% of the articles correctly while the linear Support Vector Machine scores around 48%.
Some of the ideas for trying to improve classification performance are:
- Collecting more data
- Applying feature selection
- Investigating the effects of training the classifiers with different parameters
See the wiki.