hw1p4-internet-privacy-poll.Rmd

---
title: "Internet Privacy Poll"
author: "Terrel Shumway"
date: "04/28/2015"
output: html_document
---

This document presents answers for homework 1 part 4.

## Loading and Summarizing the Dataset

```{r}
baseurl = "https://courses.edx.org/c4x/MITx/15.071x_2/asset/"

getdata = function(local){
  if(!file.exists(local)){
    library(downloader)
    remote = paste0(baseurl,local)
    #print(remote)
    download(remote,local)
  }
  data = read.csv(local)
  data
}

data = getdata("AnonymityPoll.csv")

t = c(table(data$Smartphone),na=sum(is.na(data$Smartphone)))

u = as.data.frame(table(data$Region,data$State))
mws = u[u$Var1=="Midwest"&u$Freq>0,]
mws = sort(as.character(mws[mws$Var1=="Midwest"&mws$Freq>0,"Var2"]))

south = u[u$Var1=="South"&u$Freq>0,]
bigsouth = south[which.max(south$Freq),]

```
problem     | answer
------------|--------
problem 1.1 | `r nrow(data)`
problem 1.2 | `r t[c(2,1,3)]`
problem 1.3a | `r mws`
problem 1.3b | `r bigsouth$Var2`

## Evaluating Missing Values
problem 2.1:
```{r}
table(data$Internet,data$Smartphone)
```

problem 2.2: 
internet: `r sum(is.na(data$Internet.Use))`
smartphone: `r sum(is.na(data$Smartphone))`


```{r}
cond = (data$Internet.Use==1|data$Smartphone==1)
limited = subset(data,Internet.Use==1|Smartphone==1)
nrow(data[cond,])
nrow(data[cond&!is.na(cond),])
```
Note the difference between the `[` operator and the `subset` function.

problem 2.3: `r nrow(limited)`

## Summarizing Opinions About Internet Privacy

problem 3.1:
```{r}
n = sapply(colnames(limited),function(x){sum(is.na(limited[,x]))})
names(n)[n>0]
```

problem 3.2: `r mean(limited$Info.On.Internet)`


problem 3.3:
```{r}
table(limited$Info.On.Internet)
library(ggplot2)
ggplot(limited,aes(x=Info.On.Internet)) + 
  geom_histogram(binwidth=1,color="black",fill="orange")
# TODO: add text labels above bar
```

problem 3.4,3.5,3.6,3.7:
```{r}
sapply(c(
  "Worry.About.Info",
  "Anonymity.Possible",
  "Tried.Masking.Identity",
  "Privacy.Laws.Effective"
  ),function(x,d){mean(d[,x],na.rm=TRUE)},
  limited)
```

## Relating Demographics to Polling Results

problem 4.1:
```{r}
library(ggplot2)
ggplot(limited,aes(x=Age)) + geom_histogram(binwidth=10,color="black",fill="orange")
```

problem 4.2:
```{r}
max(table(limited$Age,limited$Info.On.Internet))
```

problem 4.3:
```{r}
jitter(1:10)
```

```{r}
ggplot(limited,aes(x=Age,y=Info.On.Internet)) + geom_point(position = position_jitter(w=0.2,h=0.3))
```


problem 4.4:

This heat map would be more interesting if `Info.On.Internet` were a set of belief indicators, e.g. `Name.On.Internet`,`DOB.On.Internet`, etc. instead of a count.

```{r}
binyears = 10
hm = as.data.frame(as.matrix(table(as.integer(limited$Age/binyears),limited$Info.On.Internet)))


levels(hm$Var1)=paste(tt<-seq(10,100,binyears),tt+binyears,sep="-")

ggplot(hm, aes(x=Var1,y=Var2)) + geom_tile(aes(fill=Freq)) +  
  scale_fill_gradient(low="white",high="blue") +
  labs(y="Info on Internet",x="Age Group",fill="Count")

```

problem 4.5:
```{r}
tapply(limited$Info.On.Internet,limited$Smartphone,mean)
```


problem 4.6:
```{r}
tapply(limited$Tried.Masking.Identity,limited$Smartphone,mean,na.rm=TRUE)
```