Home

Welcome to the NewsClassification wiki!

Project overview

NewsClassification is a machine learning project in Python for predicting the popularity of news articles, expressed in number of comments.
The news site used is nu.nl, the most visited news site in the Netherlands.
Scraping is done with the lxml library, text preprocessing with NLTK and machine learning with scikit-learn.

Details

The project consists of three main parts:

Collecting
Scrape news articles using lxml
For scraped articles older than a day, get the number of comments it has received by then
Preprocessing
Convert all text to lowercase
Remove punctuation marks
Remove stopwords using NLTK
Learning
Document classification

Results

See Results for an overview of the results achieved so far.
Currently, the multinomial Naive Bayes classifier can classify slightly more than 50% of the articles correctly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Project overview

Details

Results

Clone this wiki locally