forked from gge-ucd/R-DAVIS
-
Notifications
You must be signed in to change notification settings - Fork 0
/
lesson_09_data_viz_pt1.Rmd
422 lines (306 loc) · 15.6 KB
/
lesson_09_data_viz_pt1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
---
title: Data visualization with ggplot2
---
```{r setup, echo=FALSE, purl=FALSE, message=F, results=F, warning = FALSE}
library(tidyverse)
surveys_complete <- read_csv("data/portal_data_joined.csv") %>%
filter(complete.cases(.))
```
<br>
<div class = "blue">
### Learning Objectives
* Produce scatter plots, boxplots, and time series plots using ggplot.
* Describe what faceting is and apply faceting in ggplot.
* Set universal plot settings.
</div>
<br>
We start by loading the required packages. **`ggplot2`** is included in the **`tidyverse`** package.
```{r load-package, message=FALSE, purl=FALSE}
library(tidyverse)
```
Let's read in our surveys data, but filter it to only get back rows where ALL the data are present, also known as "complete cases". We're also showing you a new little trick: using a period with a pipe. Normally, a pipe just sends the stuff on the left into the FIRST argument position in the function on the right. However, sometimes we want that stuff to get sent to a slightly different place in the righthand function. In this case, we want to send it into the `complete.cases()` function, so that function will run on the whole dataset. In order to specifically tell the pipe to send the lefthand side into this function, we put a period there. You can think of this as the target for the pipe.
```{r load-data, eval=FALSE, purl=FALSE}
surveys_complete <- read_csv("data/portal_data_joined.csv") %>%
filter(complete.cases(.))
```
## Plotting with **`ggplot2`**
**`ggplot2`** is a plotting package that makes it simple to create complex plots
from data in a data frame. It provides a more programmatic interface for
specifying what variables to plot, how they are displayed, and general visual
properties. Therefore, we only need minimal changes if the underlying data change
or if we decide to change from a bar plot to a scatterplot. This helps in creating
publication quality plots with minimal amounts of adjustments and tweaking.
**`ggplot2`** functions like data in the 'long' format, i.e., a column for every dimension,
and a row for every observation. Well-structured data will save you lots of time
when making figures with **`ggplot2`**
ggplot graphics are built step by step by adding new elements. Adding layers in
this fashion allows for extensive flexibility and customization of plots.
To build a ggplot, we will use the following basic template that can be used for different types of plots:
```
ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) + <GEOM_FUNCTION>()
```
- use the `ggplot()` function and bind the plot to a specific data frame using the
`data` argument
```{r, eval=FALSE, purl=FALSE}
ggplot(data = surveys_complete)
```
- define a mapping (using the aesthetic (`aes`) function), by selecting the variables to be plotted and specifying how to present them in the graph, e.g. as x/y positions or characteristics such as size, shape, color, etc.
```{r, eval=FALSE, purl=FALSE}
ggplot(data = surveys_complete, mapping = aes(x = weight, y = hindfoot_length))
```
- add 'geoms' – graphical representations of the data in the plot (points,
lines, bars). **`ggplot2`** offers many different geoms; we will use some
common ones today, including:
* `geom_point()` for scatter plots, dot plots, etc.
* `geom_boxplot()` for, well, boxplots!
* `geom_line()` for trend lines, time series, etc.
To add a geom to the plot use the `+` operator. Because we have two continuous variables,
let's use `geom_point()` first:
```{r first-ggplot, purl=FALSE}
ggplot(data = surveys_complete, mapping = aes(x = weight, y = hindfoot_length)) +
geom_point()
```
The `+` in the **`ggplot2`** package is particularly useful because it allows you
to modify existing `ggplot` objects. This means you can easily set up plot
templates and conveniently explore different types of plots, so the above
plot can also be generated with code like this:
```{r, first-ggplot-with-plus, eval=FALSE, purl=FALSE}
# Assign plot to a variable
surveys_plot <- ggplot(data = surveys_complete,
mapping = aes(x = weight, y = hindfoot_length))
# Draw the plot
surveys_plot +
geom_point()
```
**Notes**
- Anything you put in the `ggplot()` function can be seen by any geom layers
that you add (i.e., these are universal plot settings). This includes the x- and
y-axis mapping you set up in `aes()`.
- You can also specify mappings for a given geom independently of the
mappings defined globally in the `ggplot()` function.
- The `+` sign used to add new layers must be placed at the end of the line containing
the *previous* layer. If, instead, the `+` sign is added at the beginning of the line
containing the new layer, **`ggplot2`** will not add the new layer and will return an
error message.
```{r, ggplot-with-plus-position, eval=FALSE, purl=FALSE}
# This is the correct syntax for adding layers
surveys_plot +
geom_point()
# This will not add the new layer and will return an error message
surveys_plot
+ geom_point()
```
## Building your plots iteratively
Building plots with **`ggplot2`** is typically an iterative process. We start by
defining the dataset we'll use, lay out the axes, and choose a geom:
```{r create-ggplot-object, purl=FALSE}
ggplot(data = surveys_complete, mapping = aes(x = weight, y = hindfoot_length)) +
geom_point()
```
Then, we start modifying this plot to extract more information from it. For
instance, we can add transparency (`alpha`) to avoid overplotting:
```{r adding-transparency, purl=FALSE}
ggplot(data = surveys_complete, mapping = aes(x = weight, y = hindfoot_length)) +
geom_point(alpha = 0.1)
```
We can also add colors for all the points:
```{r adding-colors, purl=FALSE}
ggplot(data = surveys_complete, mapping = aes(x = weight, y = hindfoot_length)) +
geom_point(alpha = 0.1, color = "blue")
```
Or to color each species in the plot differently, you could use a vector as an input to the argument **color**. **`ggplot2`** will provide a different color corresponding to different values in the vector. Here is an example where we color with **`species_id`**:
```{r color-by-species, purl=FALSE}
ggplot(data = surveys_complete, mapping = aes(x = weight, y = hindfoot_length)) +
geom_point(alpha = 0.1, aes(color = species_id))
```
We can also specify the colors directly inside the mapping provided in the `ggplot()` function. This will be seen by all geom layers and the mapping will be determined by the x- and y-axis set up in `aes()`.
```{r color-by-species2, purl=FALSE}
ggplot(data = surveys_complete, mapping = aes(x = weight, y = hindfoot_length, color = species_id)) +
geom_point(alpha = 0.1)
```
Notice that we can change the geom layer and colors will be still determined by **`species_id`**
<div class = "blue">
### Challenge
Use `ggplot()` to create a scatter plot of `weight` and
`species_id` with weight on the Y-axis, and species_id on the X-axis. Have the colors be coded by `plot_type`. Is this a good way to show this type of data? What might be a better graph?
<details>
<summary>ANSWER</summary>
```{r scatter-challenge, purl=FALSE}
ggplot(data = surveys_complete, mapping = aes(x = species_id, y = weight)) +
geom_point(aes(color = plot_type))
```
</details>
</div>
<br>
## Boxplot
We can use boxplots to visualize the distribution of weight within each species:
```{r boxplot, purl=FALSE}
ggplot(data = surveys_complete, mapping = aes(x = species_id, y = weight)) +
geom_boxplot()
```
By adding points to boxplot, we can have a better idea of the number of
measurements and of their distribution.
Let's also use the geometry "jitter". `geom_jitter` is almost like `geom_point` but it allows you to visualize how the density of points because it adds a small amount of random variation to the location of each point.
```{r boxplot-with-points, purl=FALSE}
ggplot(data = surveys_complete, mapping = aes(x = species_id, y = weight)) +
geom_boxplot(alpha = 0) +
geom_jitter(alpha = 0.3, color = "tomato") #notice our color needs to be in quotations
```
Notice how the boxplot layer is behind the jitter layer? What do you need to
change in the code to put the boxplot in front of the points such that it's not
hidden?
<div class = "blue">
### Challenges
1. Boxplots are useful summaries, but hide the *shape* of the distribution. For
example, if the distribution is bimodal, we would not see it in a
boxplot. An alternative to the boxplot is the violin plot, where the shape
(of the density of points) is drawn.
- Replace the box plot code from above with a violin plot; see `geom_violin()`.
2. In many types of data, it is important to consider the *scale* of the
observations. For example, it may be worth changing the scale of the axis to
better distribute the observations in the space of the plot. Changing the scale
of the axes is done similarly to adding/modifying other components (i.e., by
incrementally adding commands). Try making these modifications:
- Use the violin plot you made in Q1 and adjust the weight to be on the log~10~ scale; see `scale_y_log10()`.
3. Make a new plot to explore the distribution of `hindfoot_length` just for species NL and PF using `geom_boxplot()`. Overlay a jitter/scatter plot of the hindfoot lengths of the two species behind the boxplots. Then, add an `aes()` argument to color the datapoints (but not the boxplots) according to the plot from which the sample was taken.
*Hint:* Check the class for `plot_id`. Consider changing the class of `plot_id` from integer to factor. Why does this change how R makes the graph?
<details>
<summary>ANSWER</summary>
```{r}
#1 + 2
ggplot(data = surveys_complete, mapping = aes(x = species_id, y = weight)) +
geom_jitter(alpha = 0.3, color = "tomato") +
geom_violin(alpha = 0) +
scale_y_log10()
#3
surveys_complete %>%
filter(species_id == "NL" | species_id == "PF") %>%
ggplot(mapping = aes(x= species_id, y = hindfoot_length)) +
geom_jitter(alpha = 0.3, aes(color = as.factor(plot_id))) +
geom_boxplot()
```
</details>
</div>
<br>
## Plotting time series data
Let's calculate number of counts per year for each species. First we need
to group the data and count records within each group. We can quickly use the dplyr function `count` to do this. `count` is very similar to the function `tally` we have seen before, but it interally calls `group_by` before the function and `ungroup` after.
```{r, purl=FALSE}
yearly_counts <- surveys_complete %>%
count(year, species_id)
```
Time series data can be visualized as a line plot with years on the x axis and counts
on the y axis:
```{r first-time-series, purl=FALSE}
ggplot(data = yearly_counts, mapping = aes(x = year, y = n)) +
geom_line()
```
Unfortunately, this does not work because we plotted data for all the species
together. We need to tell ggplot to draw a line for each species by modifying
the aesthetic function to include `group = species_id`:
```{r time-series-by-species, purl=FALSE}
ggplot(data = yearly_counts, mapping = aes(x = year, y = n, group = species_id)) +
geom_line()
```
We will be able to distinguish species in the plot if we add colors (using `color` also automatically groups the data):
```{r time-series-with-colors, purl=FALSE}
ggplot(data = yearly_counts, mapping = aes(x = year, y = n, color = species_id)) +
geom_line()
```
## Faceting
**`ggplot2`** has a special technique called *faceting* that allows the user to split one
plot into multiple plots based on a factor included in the dataset. We will use it to
make a time series plot for each species:
```{r first-facet, purl=FALSE}
ggplot(data = yearly_counts, mapping = aes(x = year, y = n)) +
geom_line() +
facet_wrap(~ species_id)
```
<div class = "blue">
### Challenge
You are looking at a new dataset shared with you by a collaborator. You received the dataset shortly after the vernal equinox. Your collaborator didn't really give you any context on what the data represent, and you need to do some preliminary visualizations before you can really even formulate a question for them. Import the mystery dataset using:
```{r}
mystery <- read_csv("https://raw.githubusercontent.com/gge-ucd/R-DAVIS/master/data/mysteryData.csv")
```
Can you figure out what this dataset represents?
*Hint* Use your new knowledge of faceting to break up the data into groups, and consider changing the size and transparency of your `geom_`'s to get a better look!
<details>
<summary>ANSWER</summary>
```{r}
# Preview the data
mystery %>%
head(5)
# Plot the data
ggplot(data = mystery, mapping = aes(x = x, y = y)) +
facet_wrap(~ Group) +
geom_point(size = 0.1, alpha = 0.01)
```
If all went well, and you faceted by group, set points to be very small, and transparency to be very high (i.e., a low `alpha` setting), you should discover that each group is the outline of a different animal! You may notice that there is some distortion going on, and our foxy friend in Group B appears to have some thick thighs. Try equalizing the coordinate space in the x- and y-axes by adding a `coord_equal()` to your `ggplot()` call:
```{r}
## Equalize coordinate mapping
ggplot(data = mystery, mapping = aes(x = x, y = y)) +
facet_wrap(~ Group) +
geom_point(size = 0.1, alpha = 0.01) +
coord_equal()
```
This challenge was inspired by Trevor Branch's [Coelocanth post on Twitter](https://twitter.com/TrevorABranch/status/1227493205337182209) and adapted by [Christian John](https://jepsonnomad.github.io/).
</details>
</div>
<br>
## **`ggplot2`** themes
ggplot Themes are a great, easy addition that can make all your plots more readable (and a lot more pretty!)
In addition to `theme_bw()`, which changes the plot background to white, **`ggplot2`**
comes with several other themes which can be useful to quickly change the look
of your visualization. The complete list of themes is available
at <http://docs.ggplot2.org/current/ggtheme.html>. `theme_minimal()` and
`theme_light()` are popular, and `theme_void()` can be useful as a starting
point to create a new hand-crafted theme.
Usually plots with white background look more readable when printed. We can set
the background to white using the function `theme_bw()`. Additionally, you can remove
the grid:
```{r facet-by-species-and-sex-white-bg, purl=FALSE}
ggplot(data = yearly_counts, mapping = aes(x = year, y = n)) +
geom_line() +
facet_wrap(~ species_id) +
theme_bw() +
theme(panel.grid = element_blank())
```
<div class = "blue">
### Challenge 1
Let's make one final change to our facet wrapped plot of our yearly count data. What if we wanted to split the counts of species up by sex where the lines for each sex are different colors? Make sure you have a nice theme on your graph too!
*Hint* Make a new dataframe using the `count` function we learned earlier!
<details>
<summary>ANSWER</summary>
```{r}
#new data frame counting the number of each sex of each species
yearly_sex_counts <- surveys_complete %>%
count(year, species_id, sex)
#plot code
ggplot(data = yearly_sex_counts, mapping = aes(x = year, y = n, color = sex)) +
geom_line() +
facet_wrap(~ species_id) +
theme_bw()
```
</details>
</div>
<br>
<div class = "blue">
### Challenge 2
Use what you just learned to create a plot that depicts how the average weight
of each species changes through the years.
<details>
<summary>ANSWER</summary>
```{r average-weight-time-series, purl=FALSE}
#create a new dataframe
yearly_weight <- surveys_complete %>%
group_by(year, species_id) %>%
summarize(avg_weight = mean(weight))
ggplot(data = yearly_weight, mapping = aes(x=year, y=avg_weight)) +
geom_line() +
facet_wrap(~ species_id) +
theme_bw()
```
</details>
</div>
<br>
This lesson was contributed by [Martha Zillig](https://github.com/MarthaWohlfeil).