-
Notifications
You must be signed in to change notification settings - Fork 13
/
lesson_06_data_manipulation_pt1.Rmd
363 lines (280 loc) · 14.2 KB
/
lesson_06_data_manipulation_pt1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
---
title: "Manipulating and analyzing data in the tidyverse, Part 1"
---
```{r, echo=FALSE, purl=FALSE, message = FALSE}
#source("setup.R")
surveys <- read.csv("data/portal_data_joined.csv")
suppressWarnings(surveys$date <- lubridate::ymd(paste(surveys$year,
surveys$month,
surveys$day,
sep = "-")))
```
<div class = "blue">
### Learning Objectives
* Describe the purpose of the **`dplyr`** and **`tidyr`** packages.
* Select certain columns in a data frame with the **`dplyr`** function `select`.
* Select certain rows in a data frame according to filtering conditions with the **`dplyr`** function `filter`.
* Link the output of one **`dplyr`** function to the input of another function with the 'pipe' operator `%>%`.
* Add new columns to a data frame that are functions of existing columns with `mutate`.
* Use the split-apply-combine concept for data analysis.
* Use `summarize`, `group_by`, and `tally` to split a data frame into groups of observations, apply a summary statistics for each group, and then combine the results.
</div>
## Data Manipulation using **`dplyr`** and **`tidyr`**
Bracket subsetting is handy, but it can be cumbersome and difficult to read,
especially for complicated operations. Enter **`dplyr`**. **`dplyr`** is a package for
making tabular data manipulation easier. It pairs nicely with **`tidyr`** which enables you to swiftly convert between different data formats for plotting and analysis.
Packages in R are basically sets of additional functions that let you do more
stuff. The functions we've been using so far, like `str()` or `data.frame()`,
come built into R; packages give you access to more of them. Before you use a
package for the first time you need to install it on your machine, and then you
should import it in every subsequent R session when you need it. You should
already have installed the **`tidyverse`** package. This is an
"umbrella-package" that installs several packages useful for data analysis which
work together well such as **`tidyr`**, **`dplyr`**, **`ggplot2`**, **`tibble`**, etc.
The **`tidyverse`** package tries to address 3 major problems with some of base R functions:
1. The results from a base R function sometimes depend on the type of data.
2. Using R expressions in a non standard way, which can be confusing for new learners.
3. Hidden arguments, having default operations that new learners are not aware of.
To load the package type:
```{r, message = FALSE, purl = FALSE, warning = FALSE}
library("tidyverse") ## load the tidyverse packages, incl. dplyr
```
### What are **`dplyr`** and **`tidyr`**?
The package **`dplyr`** provides easy tools for the most common data manipulation
tasks. It is built to work directly with data frames, with many common tasks
optimized by being written in a compiled language (C++). An additional feature is the
ability to work directly with data stored in an external database. The benefits of
doing this are that the data can be managed natively in a relational database,
queries can be conducted on that database, and only the results of the query are
returned.
This addresses a common problem with R in that all operations are conducted
in-memory and thus the amount of data you can work with is limited by available
memory. The database connections essentially remove that limitation in that you
can connect to a database of many hundreds of GB, conduct queries on it directly, and pull
back into R only what you need for analysis.
The package **`tidyr`** addresses the common problem of wanting to reshape your data for plotting and use by different R functions. Sometimes we want data sets where we have one row per measurement. Sometimes we want a data frame where each measurement type has its own column, and rows are instead more aggregated groups - like plots or aquaria. Moving back and forth between these formats is nontrivial, and **`tidyr`** gives you tools for this and more sophisticated data manipulation.
To learn more about **`dplyr`** and **`tidyr`** after the workshop, you may want to check out this
[handy data transformation with **`dplyr`** cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf) and this [one about **`tidyr`**](https://github.com/rstudio/cheatsheets/raw/master/data-import.pdf).
We'll read in our data using the `read_csv()` function, from the tidyverse package **`readr`**, instead of `read.csv()`, the base function for reading in data. The data we are going to be using today should already be in your R_DAVIS_2022 project in the folder `data`.
```{r,results = 'hide', purl = FALSE}
surveys <- read_csv("data/portal_data_joined.csv")
## inspect the data
str(surveys)
```
Notice that the class of the data is now `tbl_df`
This is referred to as a "tibble".
Tibbles are almost identical to R's standard data frames, but they tweak some of the old behaviors of data frames. For our purposes the only differences between data frames
and tibbles are that:
1. When you print a tibble, R displays the data type of each column under its name; it prints only the first few rows of data and only as many columns as fit
on one screen.
2. Columns of class `character` are never automatically converted into factors.
## Selecting columns and filtering rows
We're going to learn some of the most common **`dplyr`** functions: `select()`,
`filter()`, `mutate()`, `group_by()`, `summarize()`, and `join`. To select columns of a
data frame, use `select()`. The first argument to this function is the data
frame (`surveys`), and the subsequent arguments are the **columns** to keep.
```{r, results = 'hide', purl = FALSE}
select(surveys, plot_id, species_id, weight)
```
To choose **rows** based on a specific criteria, use `filter()`:
```{r, purl = FALSE}
filter(surveys, year == 1995)
```
`select` is used for **rows** and `filter` is used for **columns**.
## Pipes
What if you want to select and filter at the same time? There are three
ways to do this: use intermediate steps, nested functions, or pipes.
With intermediate steps, you create a temporary data frame and use
that as input to the next function, like this:
```{r, purl = FALSE}
surveys2 <- filter(surveys, weight < 5)
surveys_sml <- select(surveys2, species_id, sex, weight)
```
This is readable, but can clutter up your workspace with lots of objects that you have to name individually. With multiple steps, that can be hard to keep track of.
You can also nest functions (i.e. one function inside of another), like this:
```{r, purl = FALSE}
surveys_sml <- select(filter(surveys, weight < 5), species_id, sex, weight)
```
This is handy, but can be difficult to read if too many functions are nested, as
R evaluates the expression from the inside out (in this case, filtering, then selecting).
The last option is **`pipes`**. Pipes let you take
the output of one function and send it directly to the next, which is useful
when you need to do many things to the same dataset. Pipes in R look like
`%>%` and are made available via the **`magrittr`** package, installed automatically
with **`dplyr`**. If you use RStudio, you can type the pipe with <kbd>Ctrl</kbd>
+ <kbd>Shift</kbd> + <kbd>M</kbd> if you have a PC or <kbd>Cmd</kbd> +
<kbd>Shift</kbd> + <kbd>M</kbd> if you have a Mac.
```{r, purl = FALSE}
surveys %>%
filter(weight < 5) %>%
select(species_id, sex, weight)
```
In the above code, we use the pipe to send the `surveys` dataset first through
`filter()` to keep rows where `weight` is less than 5, then through `select()`
to keep only the `species_id`, `sex`, and `weight` columns. Since `%>%` takes
the object on its left and passes it as the first argument to the function on
its right, we don't need to explicitly include the data frame as an argument
to the `filter()` and `select()` functions any more.
Some may find it helpful to read the pipe like the word "then". For instance,
in the above example, we took the data frame `surveys`, *then* we `filter`ed
for rows with `weight < 5`, *then* we `select`ed columns `species_id`, `sex`,
and `weight`. The **`dplyr`** functions by themselves are somewhat simple,
but by combining them into linear workflows with the pipe, we can accomplish
more complex manipulations of data frames.
If we want to create a new object with this smaller version of the data, we
can assign it a new name:
```{r, purl = FALSE}
surveys_sml <- surveys %>%
filter(weight < 5) %>%
select(species_id, sex, weight)
surveys_sml
```
Note that the final data frame is the leftmost part of this expression.
<div class = "blue">
### Challenge
Using pipes, subset the `surveys` data to include individuals collected before
1995 and retain only the columns `year`, `sex`, and `weight`. Name this dataframe `surveys_challenge`
<details>
<summary>ANSWER</summary>
```{r, eval=FALSE, purl=FALSE}
surveys_challenge <- surveys %>%
filter(year < 1995) %>%
select(year, sex, weight)
```
</details>
</div>
<br>
## Mutate
Frequently you'll want to create new columns based on the values in existing
columns, for example to do unit conversions, or to find the ratio of values in two
columns. For this we'll use `mutate()`.
To create a new column of weight in kg:
```{r, purl = FALSE}
surveys %>%
mutate(weight_kg = weight / 1000)
```
You can also create a second new column based on the first new column within the same call of `mutate()`:
```{r, purl = FALSE}
surveys %>%
mutate(weight_kg = weight / 1000,
weight_kg2 = weight_kg * 2)
```
If this runs off your screen and you just want to see the first few rows, you
can use a pipe to view the `head()` of the data. (Pipes work with non-**`dplyr`**
functions, too, as long as the **`dplyr`** or `magrittr` package is loaded).
```{r, purl = FALSE}
surveys %>%
mutate(weight_kg = weight / 1000) %>%
head()
```
The first few rows of the output are full of `NA`s, so if we wanted to remove
those we could insert a `filter()` in the chain:
```{r, purl = FALSE}
surveys %>%
filter(!is.na(weight)) %>%
mutate(weight_kg = weight / 1000) %>%
head()
```
`is.na()` is a function that determines whether something is an `NA`. The `!`
symbol negates the result, so we're asking for every row where weight *is not* an `NA`.
<div class = "blue">
### Challenge
Create a new data frame from the `surveys` data that meets the following
criteria: contains only the `species_id` column and a new column called
`hindfoot_half` containing values that are half the `hindfoot_length` values.
In this `hindfoot_half` column, there are no `NA`s and all values are less
than 30. Name this data frame `surveys_hindfoot_half`.
**Hint**: think about how the commands should be ordered to produce this data frame!
<details>
<summary>ANSWER</summary>
```{r, eval=FALSE, purl=FALSE}
surveys_hindfoot_half <- surveys %>%
filter(!is.na(hindfoot_length)) %>%
mutate(hindfoot_half = hindfoot_length / 2) %>%
filter(hindfoot_half < 30) %>%
select(species_id, hindfoot_half)
```
</details>
</div>
<br>
## Group by and summarize
Many data analysis tasks can be approached using the *split-apply-combine*
paradigm: split the data into groups, apply some analysis to each group, and
then combine the results. **`dplyr`** makes this very easy through the use of the
`group_by()` function.
`group_by()` is often used together with `summarize()`, which collapses each
group into a single-row summary of that group. `group_by()` takes as arguments
the column names that contain the **categorical** variables for which you want
to calculate the summary statistics. So to compute the mean `weight` by sex:
```{r, purl = FALSE}
surveys %>%
group_by(sex) %>%
summarize(mean_weight = mean(weight, na.rm = TRUE))
```
You may also have noticed that the output from these calls doesn't run off the
screen anymore. It's one of the advantages of `tbl_df` over data frame.
You can also group by multiple columns:
```{r, purl = FALSE}
surveys %>%
group_by(sex, species_id) %>%
summarize(mean_weight = mean(weight, na.rm = TRUE))
```
When grouping both by `sex` and `species_id`, the first rows are for individuals
that escaped before their sex could be determined and weighted. You may notice
that the last column does not contain `NA` but `NaN` (which refers to "Not a
Number"). To avoid this, we can remove the missing values for weight before we
attempt to calculate the summary statistics on weight. Because the missing
values are removed first, we can omit `na.rm = TRUE` when computing the mean:
```{r, purl = FALSE}
surveys %>%
filter(!is.na(weight)) %>%
group_by(sex, species_id) %>%
summarize(mean_weight = mean(weight))
```
Here, again, the output from these calls doesn't run off the screen
anymore. If you want to display more data, you can use the `print()` function
at the end of your chain with the argument `n` specifying the number of rows to
display:
```{r, purl = FALSE}
surveys %>%
filter(!is.na(weight)) %>%
group_by(sex, species_id) %>%
summarize(mean_weight = mean(weight)) %>%
print(n = 15)
```
Once the data are grouped, you can also summarize multiple variables at the same
time (and not necessarily on the same variable). For instance, we could add a
column indicating the minimum weight for each species for each sex:
```{r, purl = FALSE}
surveys %>%
filter(!is.na(weight)) %>%
group_by(sex, species_id) %>%
summarize(mean_weight = mean(weight),
min_weight = min(weight))
```
<div class = "blue">
### Challenge
1. What was the weight of the heaviest animal measured in each year? Return a table with three columns: year, weight of the heaviest animal in grams, and weight in kilograms, arranged (`arrange()`) in descending order, from heaviest to lightest. (This table should have 26 rows, one for each year)
2. Try out a new function, `count()`. Group the data by `sex` and pipe the grouped data into the `count()` function. How could you get the same result
using `group_by()` and `summarize()`? Hint: see `?n`.
<details>
<summary>ANSWER</summary>
```{r, purl=FALSE}
## Answer 1
surveys %>%
filter(!is.na(weight)) %>%
group_by(year) %>%
summarize(max_weight_g = max(weight)) %>%
mutate(max_weight_kg = max_weight_g/1000) %>%
arrange()
## Answer 2
surveys %>%
group_by(sex) %>%
count()
surveys %>%
group_by(sex) %>%
summarize(n = n())
```
</details>
</div>
<br>