-
Notifications
You must be signed in to change notification settings - Fork 13
/
assignments.Rmd
592 lines (425 loc) · 26.8 KB
/
assignments.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
---
title: "Coding Practice"
output:
html_document:
toc: false
---
Assignments are designed to reinforce the code/lessons covered that week and provide you a chance to practice working with GitHub. Assignments are to be completed in your local project (on your computer) and pushed up to your GitHub repository for instructors to review by the following Wednesday. For example, homework for Week 1 (which is on September 22) should be submitted by September 28. That said, these due dates are largely suggestive as a way to help you prioritize and stay caught up as a group -- if you need, or want, more time, take it. At the end of the quarter, we will simply look over the tasks you have completed in concert with the reflection you submit.
<br>
## Assignments {.tabset .tabset-fade .tabset-pills}
### <small>Week 1</small>
<br>
Because this class caters to a range of experiences, we would like you to identify how it is you plan on earning credit for this course. Before next week, please complete [this form](https://forms.gle/HHTZJVVtScUrEMLm7) to select and describe your plans.
### <small>Week 2</small>
In Week 2's homework we are going to practice subsetting and manipulating vectors.
First, open your r-davis-in-class-project-YourName and `pull`. Remember, we always want to start working on a github project by pulling, even if we are sure nothing has changed (believe me, this small step will save you lots of headaches).
Second, open a new script in your r-davis-in-class-project-YourName and save it to your `scripts` folder. Call this new script `week_2_homework`.
Copy and paste the chunk of code below into your new `week_2_homework` script and run it. This chunk of code will create the vector you will use in your homework today. Check in your environment to see what it looks like. What do you think each line of code is doing?
```{r}
set.seed(15)
hw2 <- runif(50, 4, 50)
hw2 <- replace(hw2, c(4,12,22,27), NA)
hw2
```
1. Take your `hw2` vector and removed all the NAs then select all the numbers between 14 and 38 inclusive, call this vector `prob1`.
2. Multiply each number in the `prob1` vector by 3 to create a new vector called `times3`. Then add 10 to each number in your `times3` vector to create a new vector called `plus10`.
3. Select every other number in your `plus10` vector by selecting the first number, not the second, the third, not the fourth, etc. If you've worked through these three problems in order, you should now have a vector that is 12 numbers long that looks **exactly** like this one:
```{r, echo = F}
prob1 <- hw2[!is.na(hw2)] #removing the NAs
prob1 <- prob1[prob1 >14 & prob1 < 38] #only selecting numbers between 14 and 38
times3 <- prob1 * 3 #multiplying by 3
plus10 <- times3 + 10 #adding 10 to the whole vector
final <- plus10[c(TRUE, FALSE)] #selecting every other number
```
```{r}
final
```
Finally, save your script and push all your changes to your github account.
<details>
<summary>**DO NOT OPEN** until you are ready to see the answers</summary>
```{r, eval=T}
prob1 <- hw2[!is.na(hw2)] #removing the NAs
prob1 <- prob1[prob1 >14 & prob1 < 38] #only selecting numbers between 14 and 38
times3 <- prob1 * 3 #multiplying by 3
plus10 <- times3 + 10 #adding 10 to the whole vector
final <- plus10[c(TRUE, FALSE)] #selecting every other number using logical subsetting
```
</details>
### <small>Week 3</small>
Homework this week will be playing with the `surveys` data we worked on in class. First things first, open your r-davis-in-class-project and pull. Then create a new script in your `scripts` folder called `week_3_homework.R`.
Load your `survey` data frame with the read.csv() function. Create a new data frame called `surveys_base` with only the species_id, the weight, and the plot_type columns. Have this data frame only be the first 5,000 rows. Convert both species_id and plot_type to factors. Remove all rows where there is an NA in the weight column. Explore these variables and try to explain why a factor is different from a character. Why might we want to use factors? Can you think of any examples?
CHALLENGE:
Create a second data frame called `challenge_base` that only consists of individuals from your `surveys_base` data frame with weights greater than 150g.
<details>
<summary>**DO NOT OPEN** until you are ready to see the answers for the the homework</summary>
```{r, eval=T}
#PROBLEM 1
surveys <- read.csv("data/portal_data_joined.csv") #reading the data in
colnames(surveys) #a list of the column names
surveys_base <- surveys[1:5000, c(6, 9, 13)] #selecting rows 1:5000 and just columns 6, 9 and 13
surveys_base <- surveys_base[complete.cases(surveys_base), ] #selecting only the ROWS that have complete cases (no NAs) **Notice the comma was needed for this to work**
surveys_base$species_id <- factor(surveys_base$species_id) #converting factor data to character
surveys_base$plot_type <- factor(surveys_base$plot_type) #converting factor data to character
#Experimentation of factors
levels(surveys_base$species_id)
typeof(surveys_base$species_id)
class(surveys_base$species_id)
#CHALLENGE
challenge_base <- surveys_base[surveys_base[, 2]>150,] #selecting just the weights (column 2) that are greater than 150
```
</details>
<br>
### <small>Week 4</small>
By now you should be in the rhythm of pulling from your git repository and then creating new homework script. This week the homework will review data manipulation in the tidyverse.
1. Create a tibble named `surveys` from the portal_data_joined.csv file.
2. Subset `surveys` using Tidyverse methods to keep rows with weight between 30 and 60, and print out the first 6 rows.
3. Create a new tibble showing the maximum weight for each species + sex combination and name it `biggest_critters`. Sort the tibble to take a look at the biggest and smallest species + sex combinations. HINT: it's easier to calculate max if there are no NAs in the dataframe...
4. Try to figure out where the NA weights are concentrated in the data- is there a particular species, taxa, plot, or whatever, where there are lots of NA values? There isn’t necessarily a right or wrong answer here, but manipulate surveys a few different ways to explore this. Maybe use `tally` and `arrange` here.
5. Take `surveys`, remove the rows where weight is NA and add a column that contains the average weight of each species+sex combination to the **full** `surveys` dataframe. Then get rid of all the columns except for species, sex, weight, and your new average weight column. Save this tibble as `surveys_avg_weight`.
6. Take `surveys_avg_weight` and add a new column called `above_average` that contains logical values stating whether or not a row’s weight is above average for its species+sex combination (recall the new column we made for this tibble).
<details>
<summary>**DO NOT OPEN** until you are ready to see the answers for the homework</summary>
```{r, eval = T}
library(tidyverse)
#1
surveys <- read_csv("data/portal_data_joined.csv")
#2
surveys %>%
filter(weight > 30 & weight < 60)
#3
biggest_critters <- surveys %>%
filter(!is.na(weight)) %>%
group_by(species_id, sex) %>%
summarise(max_weight = max(weight))
biggest_critters %>% arrange(max_weight)
biggest_critters %>% arrange(desc(max_weight))
#4
surveys %>%
filter(is.na(weight)) %>%
group_by(species) %>%
tally() %>%
arrange(desc(n))
surveys %>%
filter(is.na(weight)) %>%
group_by(plot_id) %>%
tally() %>%
arrange(desc(n))
surveys %>%
filter(is.na(weight)) %>%
group_by(year) %>%
tally() %>%
arrange(desc(n))
#5
surveys_avg_weight <- surveys %>%
filter(!is.na(weight)) %>%
group_by(species_id, sex) %>%
mutate(avg_weight = mean(weight)) %>%
select(species_id, sex, weight, avg_weight)
surveys_avg_weight
#6
surveys_avg_weight <- surveys_avg_weight %>%
mutate(above_average = weight > avg_weight)
surveys_avg_weight
```
</details>
### <small>Week 5</small>
This week's questions will have us practicing pivots and conditional statements.
1. Create a tibble named `surveys` from the portal_data_joined.csv file. Then manipulate `surveys` to create a new dataframe called `surveys_wide` with a column for genus and a column named after every plot type, with each of these columns containing the mean hindfoot length of animals in that plot type and genus. So every row has a genus and then a mean hindfoot length value for every plot type. The dataframe should be sorted by values in the Control plot type column. This question will involve quite a few of the functions you've used so far, and it may be useful to sketch out the steps to get to the final result.
2. Using the original `surveys` dataframe, use the two different functions we laid out for conditional statements, ifelse() and case_when(), to calculate a new weight category variable called `weight_cat`. For this variable, define the rodent weight into three categories, where "small" is less than or equal to the 1st quartile of weight distribution, "medium" is between (but not inclusive) the 1st and 3rd quartile, and "large" is any weight greater than or equal to the 3rd quartile. (Hint: the summary() function on a column summarizes the distribution). For ifelse() and case_when(), compare what happens to the weight values of NA, depending on how you specify your arguments.
BONUS: How might you soft code the values (i.e. not type them in manually) of the 1st and 3rd quartile into your conditional statements in question 2?
<details>
<summary>**DO NOT OPEN** until you are ready to see the answers for the homework</summary>
```{r, eval = T}
# 1
library(tidyverse)
surveys <- read_csv("data/portal_data_joined.csv")
surveys_wide <- surveys %>%
filter(!is.na(hindfoot_length)) %>%
group_by(genus, plot_type) %>%
summarise(mean_hindfoot = mean(hindfoot_length)) %>%
pivot_wider(names_from = plot_type, values_from = mean_hindfoot) %>%
arrange(Control)
# 2
summary(surveys$weight)
# The final "else" argument here, where I used the T ~ "large" applies even to NAs, which is not something we want
surveys %>%
mutate(weight_cat = case_when(
weight <= 20.00 ~ "small",
weight > 20.00 & weight < 48.00 ~ "medium",
T ~ "large"
))
# To overcome this, case_when() allows us to not even use an "else" argument, and just specify the final argument to reduce confusion. This leaves NAs as is
surveys %>%
mutate(weight_cat = case_when(
weight <= 20.00 ~ "small",
weight > 20.00 & weight < 48.00 ~ "medium",
weight >= 48.00 ~ "large"
))
# The "else" argument in ifelse() does not include NAs when specified, which is useful. The shortcoming, however, is that ifelse() does not allow you to leave out a final else argument, which means it is really important to always check the work on what that last argument assigns to.
surveys %>%
mutate(weight_cat = ifelse(weight <= 20.00, "small",
ifelse(weight > 20.00 & weight < 48.00, "medium","large")))
# BONUS:
summ <- summary(surveys$weight)
# Remember our indexing skills from the first weeks? Play around with single and double bracketing to see how it can extract values
summ[[2]]
summ[[5]]
# Then you can next these into your code
surveys %>%
mutate(weight_cat = case_when(
weight >= summ[[2]] ~ "small",
weight > summ[[2]] & weight < summ[[5]] ~ "medium",
weight >= summ[[5]] ~ "large"
))
```
</details>
### <small>Week 6</small>
For our week six homework, we are going to be practicing the skills we learned with ggplot during class. You will be happy to know that we are going to be using a brand new data set called `gapminder`. This data set is looking at statistics for a few different counties including population, GDP per capita, and life expectancy. Download the data using the code below. Remember, this code is looking for a folder called `data` to put the .csv in, so make sure you have a folder named `data`, or modify the code to the correct folder name.
```{r, warnings = F}
library(tidyverse)
gapminder <- read_csv("https://gge-ucd.github.io/R-DAVIS/data/gapminder.csv") #ONLY change the "data" part of this path if necessary
```
1. First calculates mean life expectancy on each continent. Then create a plot that shows how life expectancy has changed over time in each continent. Try to do this all in one step using pipes! (aka, try not to create intermediate dataframes)
2. Look at the following code and answer the following questions. What do you think the `scale_x_log10()` line of code is achieving? What about the `geom_smooth()` line of code?
*Challenge!* Modify the above code to size the points in proportion to the population of the country.
**Hint:** Are you translating data to a visual feature of the plot?
**Hint:** There's no cost to tinkering! Try some code out and see what happens with or without particular elements.
```{r}
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(color = continent), size = .25) +
scale_x_log10() +
geom_smooth(method = 'lm', color = 'black', linetype = 'dashed') +
theme_bw()
```
3. Create a boxplot that shows the life expectency for Brazil, China, El Salvador, Niger, and the United States, with the data points in the backgroud using geom_jitter. Label the X and Y axis with "Country" and "Life Expectancy" and title the plot "Life Expectancy of Five Countries".
<details>
<summary>**DO NOT OPEN** until you are ready to see the answers! </summary>
```{r, eval = T}
library(tidyverse)
#PROBLEM 1:
gapminder %>%
group_by(continent, year) %>%
summarize(mean_lifeExp = mean(lifeExp)) %>% #calculating the mean life expectancy for each continent and year
ggplot()+
geom_point(aes(x = year, y = mean_lifeExp, color = continent))+ #scatter plot
geom_line(aes(x = year, y = mean_lifeExp, color = continent)) #line plot
#there are other ways to represent this data and answer this question. Try a facet wrap! Play around with themes and ggplotly!
#PROBLEM 2:
#challenge answer
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(color = continent, size = pop)) +
scale_x_log10() +
geom_smooth(method = 'lm', color = 'black', linetype = 'dashed') +
theme_bw()
#PROBLEM 3:
countries <- c("Brazil", "China", "El Salvador", "Niger", "United States") #create a vector with just the countries we are interested in
gapminder %>%
filter(country %in% countries) %>%
ggplot(aes(x = country, y = lifeExp))+
geom_boxplot() +
geom_jitter(alpha = 0.3, color = "blue")+
theme_minimal() +
ggtitle("Life Expectancy of Five Countries") + #title the figure
theme(plot.title = element_text(hjust = 0.5)) + #centered the plot title
xlab("Country") + ylab("Life Expectancy") #how to change axis names
```
</details>
### <small>Week 7</small>
For week 7, we're going to be working on 2 critical `ggplot` skills: recreating a graph from a dataset and **googling stuff**.
Our goal will be to make this final graph using the `gapminder` dataset:
```{r, eval=T, echo=F, warning=F, message=F}
library(tidyverse)
gapminder <- read_csv("data/gapminder.csv")
pg <- gapminder %>%
select(country, year, pop, continent) %>%
filter(year > 2000) %>%
pivot_wider(names_from = year, values_from = pop) %>%
mutate(pop_change_0207 = `2007` - `2002`)
```
```{r, eval=T, echo=F, warning=F, message=F, fig.width = 8, fig.height = 5}
pg %>%
filter(continent != "Oceania") %>%
ggplot(aes(x = reorder(country, pop_change_0207), y = pop_change_0207)) +
geom_col(aes(fill = continent)) +
facet_wrap(~continent, scales = "free") +
theme_bw() +
scale_fill_viridis_d() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none") +
xlab("Country") +
ylab("Change in Population Between 2002 and 2007")
```
The x axis labels are all scrunched up because we can't make the image bigger on the webpage, but if you make it and then zoom it bigger in RStudio it looks much better.
We'll touch on some intermediate steps here, since it might take quite a few steps to get from start to finish. Here are some things to note:
1. To get the population difference between 2002 and 2007 for each country, it would probably be easiest to have a country in each row and a column for 2002 population and a column for 2007 population.
2. Notice the order of countries within each facet. You'll have to look up how to order them in this way.
3. Also look at how the axes are different for each facet. Try looking through `?facet_wrap` to see if you can figure this one out.
4. The color scale is different from the default- feel free to try out other color scales, just don't use the defaults!
5. The theme here is different from the default in a few ways, again, feel free to play around with other non-default themes.
6. The axis labels are rotated! Here's a hint: `angle = 45, hjust = 1`. It's up to you (and Google) to figure out where this code goes!
7. Is there a legend on this plot?
This lesson should illustrate a key reality of making plots in R, one that applies as much to experts as beginners: 10% of your effort gets the plot 90% right, and 90% of the effort is getting the plot perfect. `ggplot` is incredibly powerful for exploratory analysis, as you can get a good plot with only a few lines of code. It's also extremely flexible, allowing you to tweak nearly everything about a plot to get a highly polished final product, but these little tweaks can take a lot of time to figure out!
So if you spend most of your time on this lesson googling stuff, you're not alone!
<details>
<summary>**DO NOT OPEN** until you are ready to see the answers</summary>
```{r, eval=FALSE}
library(tidyverse)
gapminder <- read_csv("data/gapminder.csv")
pg <- gapminder %>%
select(country, year, pop, continent) %>%
filter(year > 2000) %>%
pivot_wider(names_from = year, values_from = pop) %>%
mutate(pop_change_0207 = `2007` - `2002`)
pg %>%
filter(continent != "Oceania") %>%
ggplot(aes(x = reorder(country, pop_change_0207), y = pop_change_0207)) +
geom_col(aes(fill = continent)) +
facet_wrap(~continent, scales = "free") +
theme_bw() +
scale_fill_viridis_d() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none") +
xlab("Country") +
ylab("Change in Population Between 2002 and 2007")
```
</details>
### <small>Week 8</small>
Let's look at some real data from Mauna Loa to try to format and plot. These meteorological data from Mauna Loa were collected every minute for the year 2001. *This dataset has 459,769 observations for 9 different metrics of wind, humidity, barometric pressure, air temperature, and precipitation.* Download this dataset [here](data/mauna_loa_met_2001_minute.rda). Save it to your `data/` folder. Alternatively, you can read the CSV directly from the R-DAVIS Github:
`mloa <- read_csv("https://raw.githubusercontent.com/gge-ucd/R-DAVIS/master/data/mauna_loa_met_2001_minute.csv")`
Use the [README](data/mauna_loa_README.txt) file associated with the Mauna Loa dataset to determine in what time zone the data are reported, and how missing values are reported in each column. With the `mloa` data.frame, remove observations with missing values in rel_humid, temp_C_2m, and windSpeed_m_s. Generate a column called "datetime" using the year, month, day, hour24, and min columns. Next, create a column called "datetimeLocal" that converts the datetime column to Pacific/Honolulu time (*HINT*: look at the lubridate function called `with_tz()`). Then, use dplyr to calculate the mean hourly temperature each month using the temp_C_2m column and the datetimeLocal columns. (*HINT*: Look at the lubridate functions called `month()` and `hour()`). Finally, make a ggplot scatterplot of the mean monthly temperature, with points colored by local hour.
Answers:
<details>
<summary>**DO NOT OPEN** until you are ready to see the answers</summary>
```{r, eval=FALSE}
library(tidyverse)
library(lubridate)
## Data import
mloa <- read_csv("https://raw.githubusercontent.com/gge-ucd/R-DAVIS/master/data/mauna_loa_met_2001_minute.csv")
mloa2 = mloa %>%
# Remove NA's
filter(rel_humid != -99) %>%
filter(temp_C_2m != -999.9) %>%
filter(windSpeed_m_s != -999.9) %>%
# Create datetime column (README indicates time is in UTC)
mutate(datetime = ymd_hm(paste0(year,"-",
month, "-",
day," ",
hour24, ":",
min),
tz = "UTC")) %>%
# Convert to local time
mutate(datetimeLocal = with_tz(datetime, tz = "Pacific/Honolulu"))
## Aggregate and plot
mloa2 %>%
# Extract month and hour from local time column
mutate(localMon = month(datetimeLocal, label = TRUE),
localHour = hour(datetimeLocal)) %>%
# Group by local month and hour
group_by(localMon, localHour) %>%
# Calculate mean temperature
summarize(meantemp = mean(temp_C_2m)) %>%
# Plot
ggplot(aes(x = localMon,
y = meantemp)) +
# Color points by local hour
geom_point(aes(col = localHour)) +
# Use a nice color ramp
scale_color_viridis_c() +
# Label axes, add a theme
xlab("Month") +
ylab("Mean temperature (degrees C)") +
theme_classic()
```
</details>
### <small>Final assignment</small>
Forthcoming
<!-----
Alright folks, it's time for the final assignment of the quarter. The goal here is to generate an script that combines the skills we've learned throughout the quarter to produce several outputs.
<br>
#### The Data
For this project you are going to be using some data sets about flights departing New York City in 2013. There are **several** CSV files you will need to use (as with any CSVs you're handed, they are likely imperfect and incomplete).
You should download the [flights](data/nyc_13_flights_small.csv), [planes](data/nyc_13_planes.csv), and [weather](data/nyc_13_weather.csv) CSV files. (Remember to put them into your data folder of your RProject to make reading them in easier!)
Hint: You may have to combine dataframes to answer some questions. Remember our `join` family of functions? You should be able to use the `join` type we covered in class. The `flights` dataset is the biggest one, so you should probably join the other data onto this one, meaning `flights` would be the first (of "left") argument in the left join. You can't join 3 tables together at once, but you can join tables `a` and `b` to make table `ab`, then join `ab` and `c` to get table `abc` which contains the columns from all 3 original tables.
#### Things to Include
1. Plot the departure delay of flights against the precipitation, and include a simple regression line as part of the plot. Hint: there is a `geom_` that will plot a simple `y ~ x` regression line for you, but you might have to use an argument to make sure it's a regular **l**inear **m**odel. Use `ggsave` to save your ggplot objects into a **new folder** you create called "plots".
2. Create a figure that has date on the x axis and each day's mean departure delay on the y axis. Plot only months September through December. Somehow distinguish between airline carriers (the method is up to you). Again, save your final product into the "plot" folder.
3. Create a dataframe with these columns: date (year, month and day), mean_temp, where each row represents the airport, based on airport code. Save this is a new csv into you `data` folder called `mean_temp_by_origin.csv`.
4. Make a function that can: (1) convert hours to minutes; and (2) convert minutes to hours (i.e., it's going to require some sort of conditional setting in the function that determines which direction the conversion is going). Use this function to convert departure delay (currently in minutes) to hours and then generate a boxplot of departure delay times by carrier. Save this function into a script called "customFunctions.R" in your scripts/code folder.
5. Below is the plot we generated from the new data in Q4. (Base code:
`ggplot(df, aes(x = dep_delay_hrs, y = carrier, fill = carrier)) +
geom_boxplot()`). The goal is to visualize delays by carrier. Do (at least) 5 things to improve this plot by changing, adding, or subtracting to this plot. The sky's the limit here, remember we often reduce data to more succinctly communicate things.
```{r answer key, echo = F, results = 'hide', message = F, warning = F, fig.show='hide'}
library(tidyverse)
#flight <- read.csv("data/nyc_13_flights.csv")
#flight <- flight[sample(1:nrow(flight), size = 50000),]
#write.csv(flight, "data/nyc_13_flights_small.csv", row.names = F)
flight <- read.csv("data/nyc_13_flights_small.csv")
planes <- read.csv("data/nyc_13_planes.csv")
weather <- read.csv("data/nyc_13_weather.csv")
intersect(colnames(flight), colnames(planes))
intersect(colnames(flight), colnames(weather))
df <- flight %>%
left_join(planes) %>%
left_join(weather)
#Plot the departure delay of flights against the precipitation, and include a simple regression line as part of the plot
colnames(df)
max(df$dep_delay, na.rm = T)
df %>%
filter(dep_delay > 0) %>%
ggplot(aes(x = precip, y = dep_delay)) +
geom_point() +
geom_smooth(method = "lm")
# Create a figure that has date on the x axis and mean departure delay on the y axis. Plot only months September through December. Somehow distinguish between airline carriers (the method is up to you).
library(lubridate)
df %>%
filter(dep_delay > 0) %>%
filter(month %in% c(9:12)) %>%
mutate(date = ymd(paste(year, month, day, sep = "-"))) %>%
mutate(mean_dep_delay = mean(dep_delay, na.rm = T)) %>%
unique() %>%
ggplot(aes(x = date, y = mean_dep_delay)) +
geom_point() +
facet_wrap(~carrier)
# Create a dataframe with the average temperature by month at each origin airport, where the data is wide (i.e. every airport has a column)
df %>%
group_by(month, origin) %>%
summarize(mean_temp = mean(temp, na.rm = T)) %>%
pivot_wider(names_from = origin, values_from = mean_temp)
#4. Make a function that can: (1) convert hours to minutes; and (2) convert minutes to hours (i.e., it's going to require some sort of conditional setting in the function that determines which direction the conversion is going); use this function to convert departure delay (currently in minutes) to hours.
min2hr <- function(hr = NULL, min = NULL, from_unit = NULL){
if (from_unit == "minute"){
hour = min/60
print(hour)
} else if (from_unit == "hour"){
minute = hr*60
print(minute)
}
}
min2hr(min = 760, from_unit = "minute")
min2hr(hr = 7, from_unit = "hour")
# or
min2hr <- function(hr = NULL, min = NULL, from_unit = NULL){
ifelse(from_unit == "minute", min/60, hr*60)
}
# or
min2hr <- function(x = NULL, from_unit = NULL){
ifelse(from_unit == "minute", x/60, x*60)
}
min2hr(760, from_unit = "minute")
min2hr(7, from_unit = "hour")
df$dep_delay_hrs <- NA
for(i in 1:nrow(df)){
df$dep_delay_hrs[i] <- min2hr(df$dep_delay[i], from_unit = "minute")
}
# or
df$dep_delay_hrs <- map(df$dep_delay, ~ min2hr(.x, from_unit = "minute"))
## OR JUST TO SHOW A COOL FUNCTION
vec_min2hr <- Vectorize(min2hr,'x')
df <- df %>% mutate(dep_delay_hrs = vec_min2hr(dep_delay,from_unit = 'minute'))
```
```{r, echo = F}
df %>%
ggplot(aes(x = dep_delay_hrs, y = carrier, fill = carrier)) +
geom_boxplot()
```
--->