Files for my solution to this Kaggle competition
The task description can be found here:
In this competition, we challenge you to develop an analytics model that will help the New York Times understand the features of a blog post that make it popular.
The main script calling the others in order to generate a prediction.
Loads data into a dataframe.
Different scripts that generate a corpus from text fields and add them as predictors.
This process includes creating linear models to determine significative terms in order to do variable selection.
Splits training data into training and test. TODO: do cross validation.
Trains a Random Forest and makes predictions.
Contains different predictions as CSV files.
- Use ensemble methods.
- Do cross-validation for better parameter selection.
- Impute missing values for section/subsection. Right now empty values are taking as a factor level and therefore as meaninful for predictions.
- Do more exploratory data analysis, specially for missclassified cases (e.g. use word clouds).
- Filter out non memingful frequent terms in corpus (e.g. new, york, etc.)