pml-project.Rmd

---
title: "Practical Machine Learning Project"
author: "Walter Jessen"
date: "February 18, 2017"
output: html_document
---

# Introduction 

## Background

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset). 

## Data

The training data for this project are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

The test data are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har. If you use the document you create for this class for any purpose please cite them as they have been very generous in allowing their data to be used for this kind of assignment. 

## Goal

The goal of your project is to predict the manner in which they did the exercise. This is the "classe" variable in the training set. You may use any of the other variables to predict with. You should create a report describing how you built your model, how you used cross validation, what you think the expected out of sample error is, and why you made the choices you did. You will also use your prediction model to predict 20 different test cases. 

# Data Processing

## Load necessary packages
```{r load_packages, messages=FALSE}
library(caret) # also loads lattice, ggplot2
library(gbm) # also loads survival, splines, parallel
library(plyr)
library(randomForest)
```

## Download and read raw data
```{r load_data}
trainURL<-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
download.file(trainURL,destfile="pml-training.csv")
testURL<-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(testURL,destfile="pml-testing.csv")
traindf<-read.csv("pml-training.csv",sep=",",header=TRUE,stringsAsFactors=FALSE,na.strings=c("NA",""))
testdf<-read.csv("pml-testing.csv",sep=",",header=TRUE,stringsAsFactors=FALSE,na.strings=c("NA",""))
```

## Examine and clean the data

Keep the test set separate, only perform training/testing on traindf. Examine the structure of traindf.
```{r examine_traindf, results=FALSE}
str(traindf)
```

Many columns contain NA values; remove them by apply'ing over columns summing the number that pass is.na (i.e. TRUE).
Only keep columns where there are zero NA values (nasumindex==0).
```{r remove_na_from_traindf}
nasumindex<-apply(traindf,2,function(x){sum(is.na(x))})
traindf<-traindf[,which(nasumindex==0)]
```

Also remove the X, user_name, timestamp and window columns.
```{r remove_timestamp_window_columns}
traindf<-traindf[,-(1:7)]
```

This removed 107 of the 160 variables. Now split traindf up into a training and test set for cross validation.
```{r split_traindf}
set.seed(6)
inTrain<-createDataPartition(y=traindf$classe,p=0.6,list=FALSE)
training<-traindf[inTrain,]
testing<-traindf[-inTrain,]
dim(training)
dim(testing)
```

# Model building

## Select and compare models

Since this is a classification problem, select the gradient boosting (gbm) and random forest (rf) algorithms. Boosting and random forest are some of the most used/accurate algorithms. Additionally, evaluate the k nearest neighbors (knn) algorithm because I've found that algorithm to be useful for classification previously. The kappa metric (i.e. measure of concordance) is selected as the comparison criteria. It is extremely important to use cross validation when running the random forest algorithm. Thus, to reduce the risk of overfitting, a 10-fold cross validation is used.
```{r train_models}
fitControl<-trainControl(method="cv",number=10)
# Gradient boosting algorithm  (boost with trees)
gbmFit<-train(classe~.,data=training,preProcess=c("center","scale"),method="gbm",metric="Kappa",trControl=fitControl,verbose=FALSE)
# Random forest algorithm (bootstrap aggregating)
rfFit<-train(classe~.,data=training,preProcess=c("center","scale"),method="rf",metric="Kappa",trControl=fitControl,verbose=FALSE)
# K nearest neighbors algorithm
knnFit<-train(classe~.,data=training,preProcess=c("center","scale"),method="knn",metric="Kappa",trControl=fitControl)
```

## Compare models

Compare models by resampling. Plot kappa values for each model in order to select the best model to use on the test set.
```{r visualize_resamples}
library(lattice)
resamps<-resamples(list(gbm=gbmFit,rf=rfFit,knn=knnFit))
summary(resamps)
bwplot(resamps,metric="Kappa",main="Kappa Values by Algorithm\nK Nearest Neighbors (knn), Gradient Boosting (gbm), Random Forest (rf)")
```

Based on the summary table and box-and-whisker plot above, the random forest algorithm performs better than gradient goosting or k nearest neighbors with a higher mean accuracy, higher mean kappa value, and tighter distribution of values. Select the random forest model for cross validation and prediction of the 20 different test cases.

## Cross validation

Perform cross validataion on the test set using the prediction model based on the random forest algorithm.
```{r cross_validation}
cvPred<-predict(rfFit,testing)
confusionMatrix(cvPred,testing$classe)
```

## Expected out-of-sample error

Calculate the expected out-of-sample (out-of-bag) error on the prediciton model based on the random forest algorithm.
```{r oob_estimate}
rfFit$finalModel
```

The expected out-of-sample (out-of-bag) error is 0.89%.

# Predict 20 different test cases

Use the prediction model based on the random forest algorithm to predict 20 different test cases (testdf).
```{r make_prediction}
results<-predict(rfFit,newdata=testdf)
print(as.data.frame(results))
```