Name		Name	Last commit message	Last commit date
parent directory ..
data		data
results		results
Competition_TextData.R		Competition_TextData.R
README.md		README.md
add_corpus_abstract.R		add_corpus_abstract.R
add_corpus_all.R		add_corpus_all.R
add_corpus_headline.R		add_corpus_headline.R
add_corpus_snippet.R		add_corpus_snippet.R
explore.R		explore.R
loader.R		loader.R
main.R		main.R
split_eval.R		split_eval.R
train_glm.R		train_glm.R
train_random_forest.R		train_random_forest.R

README.md

Kaggle 15.071x - The Analytics Edge (Spring 2015)

Files for my solution to this Kaggle competition

Task description

The task description can be found here:

In this competition, we challenge you to develop an analytics model that will help the New York Times understand the features of a blog post that make it popular.

Files description

`main.R`

The main script calling the others in order to generate a prediction.

`loader.R`

Loads data into a dataframe.

`add_corpus_XXX.R`

Different scripts that generate a corpus from text fields and add them as predictors.

This process includes creating linear models to determine significative terms in order to do variable selection.

`split_eval.R`

Splits training data into training and test. TODO: do cross validation.

`train_random_forest.R`

Trains a Random Forest and makes predictions.

`results` folder

Contains different predictions as CSV files.

Future works

Use ensemble methods.
Do cross-validation for better parameter selection.
Impute missing values for section/subsection. Right now empty values are taking as a factor level and therefore as meaninful for predictions.
Do more exploratory data analysis, specially for missclassified cases (e.g. use word clouds).
Filter out non memingful frequent terms in corpus (e.g. new, york, etc.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kaggle-analytics-edge-15

kaggle-analytics-edge-15

README.md

Kaggle 15.071x - The Analytics Edge (Spring 2015)

Task description

Files description

`main.R`

`loader.R`

`add_corpus_XXX.R`

`split_eval.R`

`train_random_forest.R`

`results` folder

Future works

Files

kaggle-analytics-edge-15

Directory actions

More options

Directory actions

More options

Latest commit

History

kaggle-analytics-edge-15

Folders and files

parent directory

README.md

Kaggle 15.071x - The Analytics Edge (Spring 2015)

Task description

Files description

main.R

loader.R

add_corpus_XXX.R

split_eval.R

train_random_forest.R

results folder

Future works

`main.R`

`loader.R`

`add_corpus_XXX.R`

`split_eval.R`

`train_random_forest.R`

`results` folder