-
Notifications
You must be signed in to change notification settings - Fork 0
/
hw2p4-state-data.Rmd
122 lines (95 loc) · 2.15 KB
/
hw2p4-state-data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
---
title: "State Data"
author: "Terrel Shumway"
date: "04/30/2015"
output: html_document
---
This document presents answers for homework 2 part 4.
## Getting and Cleaning the Data
```{r}
baseurl = "https://courses.edx.org/c4x/MITx/15.071x_2/asset/"
getdata = function(local){
if(!file.exists(local)){
library(downloader)
remote = paste0(baseurl,local)
print(remote)
download(remote,local)
}
read.csv(local)
}
data = getdata("statedata.csv")
```
## Data exploration
problem 1.1
```{r}
plot(data$x,data$y)
library(ggplot2)
ggplot(data,aes(x=x,y=y))+geom_point()
```
problem 1.2:
```{r}
sort(tapply(data$HS.Grad,data$state.region,max),decreasing=TRUE)
```
problem 1.3:
```{r}
ggplot(data,aes(x=state.region,y=Murder))+geom_boxplot()
```
problem 1.4:
```{r}
ne = subset(data,state.region=="Northeast")
ne[which.max(ne$Murder),c("state.name","Murder")]
# why doesn't this work?
#data[tapply(data$Murder,data$state.region,which.max),]
```
```{r}
# this works, but is equivalent to the first above, not an improvement
library(data.table)
dt = as.data.table(data)
setkey(dt,state.region,Murder)
dt["Northeast",state.name,mult="last"]
rm(dt)
```
problem 2.1,2.2:
```{r}
m1.form = Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad + Frost + Area
m1 = lm(m1.form,data=data)
s = summary(m1)
s$coefficients["Income",]
```
problem 2.3:
```{r}
ggplot(data,aes(x=Income,y=Life.Exp))+geom_point()
```
problem 2.3:
```{r}
cor(data$Income,data$Life.Exp)
```
## Predicting Life Expectancy
problem 3.1:
```{r}
m2 = step(m1)
t = summary(m2)
c(s$r.squared,t$r.squared)
```
## Analyzing the Predictions
problem 3.3:
```{r}
pf = data.frame(statecode = data$state.abb,predicted=predict(m2),actual=data$Life.Exp,residual=resid(m2))
pf[which.min(pf$predicted),]
pf[which.min(pf$actual),]
```
problem 3.4:
```{r}
pf[which.max(pf$predicted),]
pf[which.max(pf$actual),]
```
problem 3.5:
```{r}
pf[which.min(abs(pf$residual)),]
pf[which.max(abs(pf$residual)),]
```
## Bonus
```{r}
ggplot(pf,aes(y=predicted,x=actual))+geom_smooth(method="lm")+geom_text(label=pf$statecode,size=4)
# TODO: figure out how to set the size of the plot. This is starting to annoy me. a lot.
```