forked from gge-ucd/R-DAVIS
-
Notifications
You must be signed in to change notification settings - Fork 0
/
final_draft.Rmd
109 lines (82 loc) · 5.47 KB
/
final_draft.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
Alright folks, it's time for the final assignment of the quarter. The goal here is to generate an script that combines the skills we've learned throughout the quarter to produce several outputs.
<br>
#### The Data
For this project you are going to be using some data sets about flights departing New York City in 2013. There are **several** CSV files you will need to use (as with any CSVs you're handed, they are likely imperfect and incomplete).
You should download the [flights](data/nyc_13_flights_small.csv), [planes](data/nyc_13_planes.csv), and [weather](data/nyc_13_weather.csv) CSV files. (Remember to put them into your data folder of your RProject to make reading them in easier!)
Hint: You may have to combine dataframes to answer some questions. Remember our `join` family of functions? You should be able to use the `join` type we covered in class. The `flights` dataset is the biggest one, so you should probably join the other data onto this one, meaning `flights` would be the first (of "left") argument in the left join. You can't join 3 tables together at once, but you can join tables `a` and `b` to make table `ab`, then join `ab` and `c` to get table `abc` which contains the columns from all 3 original tables.
#### Things to Include
1. Plot the departure delay of flights against the precipitation, and include a simple regression line as part of the plot. Hint: there is a `geom_` that will plot a simple `y ~ x` regression line for you, but you might have to use an argument to make sure it's a regular **l**inear **m**odel. Use `ggsave` to save your ggplot objects into a **new folder** you create called "plots".
2. Create a figure that has date on the x axis and each day's mean departure delay on the y axis. Plot only months September through December. Somehow distinguish between airline carriers (the method is up to you). Again, save your final product into the "plot" folder.
3. Create a dataframe with these columns: date (year, month and day), mean_temp, where each row represents the airport, based on airport code. Save this is a new csv into you `data` folder called `mean_temp_by_origin.csv`.
4. Make a function that can: (1) convert hours to minutes; and (2) convert minutes to hours (i.e., it's going to require some sort of conditional setting in the function that determines which direction the conversion is going). Use this function to convert departure delay (currently in minutes) to hours and then generate a boxplot of departure delay times by carrier. Save this function into a script called "customFunctions.R" in your scripts/code folder.
5. Below is the plot we generated from the new data in Q4. (Base code:
`ggplot(df, aes(x = dep_delay_hrs, y = carrier, fill = carrier)) +
geom_boxplot()`). The goal is to visualize delays by carrier. Do (at least) 5 things to improve this plot by changing, adding, or subtracting to this plot.
```{r, echo = F, results = F}
library(tidyverse)
#flight <- read.csv("data/nyc_13_flights.csv")
#flight <- flight[sample(1:nrow(flight), size = 50000),]
#write.csv(flight, "data/nyc_13_flights_small.csv", row.names = F)
flight <- read.csv("data/nyc_13_flights_small.csv")
planes <- read.csv("data/nyc_13_planes.csv")
weather <- read.csv("data/nyc_13_weather.csv")
intersect(colnames(flight), colnames(planes))
intersect(colnames(flight), colnames(weather))
df <- flight %>%
left_join(planes) %>%
left_join(weather)
#Plot the departure delay of flights against the precipitation, and include a simple regression line as part of the plot
colnames(df)
max(df$dep_delay, na.rm = T)
df %>%
filter(dep_delay > 0) %>%
ggplot(aes(x = precip, y = dep_delay)) +
geom_point() +
geom_smooth(method = "lm")
# Create a figure that has date on the x axis and mean departure delay on the y axis. Plot only months September through December. Somehow distinguish between airline carriers (the method is up to you).
library(lubridate)
df %>%
filter(dep_delay > 0) %>%
filter(month %in% c(9:12)) %>%
mutate(date = ymd(paste(year, month, day, sep = "-"))) %>%
mutate(mean_dep_delay = mean(dep_delay, na.rm = T)) %>%
unique() %>%
ggplot(aes(x = date, y = mean_dep_delay)) +
geom_point() +
facet_wrap(~carrier)
# Create a dataframe with the average temperature by month at each origin airport, where the data is wide (i.e. every airport has a column)
df %>%
group_by(month, origin) %>%
summarize(mean_temp = mean(temp, na.rm = T)) %>%
pivot_wider(names_from = origin, values_from = mean_temp)
#4. Make a function that can: (1) convert hours to minutes; and (2) convert minutes to hours (i.e., it's going to require some sort of conditional setting in the function that determines which direction the conversion is going); use this function to convert departure delay (currently in minutes) to hours.
min2hr <- function(hr = NULL, min = NULL, unit){
if (unit == "minute"){
hour = min/60
print(hour)
} else if (unit == "hour"){
minute = hr*60
print(minute)
}
}
min2hr(min = 760, unit = "minute")
min2hr(hr = 7, unit = "hour")
min2hr <- function(hr = NULL, min = NULL, unit){
ifelse(unit == "minute", min/60, hr*60)
}
min2hr <- function(x = NULL, unit){
ifelse(unit == "minute", x/60, x*60)
}
min2hr(760, unit = "minute")
min2hr(7, unit = "hour")
df$dep_delay_hrs <- NA
for(i in 1:nrow(df)){
df$dep_delay_hrs[i] <- min2hr(df$dep_delay[i], unit = "minute")
}
#OR
df$dep_delay_hrs <- map(df$dep_delay, ~ min2hr(.x, unit = "minute"))
```
```{r, echo = F}
ggplot(df, aes(x = dep_delay_hrs, y = carrier, fill = carrier)) +
geom_boxplot()
```