-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Selman Ercan edited this page Jun 10, 2016
·
16 revisions
Welcome to the NewsClassification wiki!
NewsClassification is a machine learning project in Python for predicting the popularity of news articles, expressed in number of comments.
The news site used is nu.nl, the most visited news site in the Netherlands.
Scraping is done with the lxml
library, text preprocessing with NLTK
and machine learning with scikit-learn
.
The project consists of three main parts:
- Collecting
- Scrape news articles using
lxml
- For scraped articles older than a day, get the number of comments it has received by then
- Preprocessing
- Convert all text to lowercase
- Remove punctuation marks
- Remove stopwords using
NLTK
- Learning
- Document classification
See Results for an overview of the results achieved so far.
Currently, the multinomial Naive Bayes classifier can classify slightly more than 50% of the articles correctly.