-
Notifications
You must be signed in to change notification settings - Fork 13
/
lesson_11_data_viz_pt2.Rmd
306 lines (208 loc) · 15.4 KB
/
lesson_11_data_viz_pt2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
---
title: Data visualization best practices for publication and accessibility
---
<br>
<div class = "blue">
## Learning Objectives
* Learn some basic visualization dos and don'ts
* Learn how to customize color in plots
* Gain a basic understanding of package `cowplot`
* Introduction to interactive `plotly` package
</div>
<br>
## Data Visualization in R
There are many tips and tricks that are available for the multitude of visualization packages in R. However, there aren't as many simple rules or suggestions on what actually makes a good visualization. This starts with the "grammar of graphics", which is the fundamental rules or principals which describe an art or science (from Wickham 2010).
> "A good grammar will allow us to gain insight into the composition of complicated graphics, and reveal unexpected connections between seemingly different graphics (Cox 1978)"
Because there are so many options and methods to plot our data in R, we need to think about how we are going to represent the data, how can that data be interpreted visually, and what story it may tell.
A very nice example of this is provided by this animation (created by Darkhorse Analytics, and used in Jenny Bryan's excellent [stat545](http://stat545.com/block015_graph-dos-donts.html) course). It shows how *simplification* can make a big difference in communication.
![Less is more graphic](img/less-is-more-darkhorse-analytics.gif)
### Exploration vs. Communication
One thing to consider is what the objective is when creating a visualization or plot. When we build plots for exploratory purposes, we already know what the variables are we are using, and the objective is more about what sort of patterns the data might show. When communicating, the objective is more about providing a stand-alone snapshot which helps others *understand* what you are trying to convey. We'll cover more on this when we talk about `Rmarkdown`.
### `ggplot2`
While there are many visualization options in R, we believe the most comprehensive and powerful is the `ggplot2` package. Much of the class has used/follows `ggplot`, so here's a little background that might be useful.
- Based on Grammar of Graphics book by Leland Wilkinson hence ‘gg’
- Each part of the plot is layered or built upon the other parts (like building legos).
- Consider parts of a `ggplot2` as parts of a house.
- Data = The materials the house is built from (`ggplot(data=yourdata)`)
- Plot Type = The structure/design of your house (how will it look?) (`geom_`)
- Aesthetics = What the exterior looks like, i.e., the paint/decor (`aes()`)
- Stats = Ways to wire or plumb your house...how to tie your data together, or transform it (`stat_`)
A questions we often get asked when helping students with ggplot figures is, *how can I put two y-axis on my graph?* Short answer, with ggplot, you can't. There is a good reason for this though, which is basically that the creators of ggplot2 believe that duel y-axis are confusing, eaisly misunderstood and not invertible (given a point on the plot space, you can not uniquely map it back to a point in the data space). We agree with this logic, and would urge you to never put two differnt y-axis on the same plot. Instead, make two seperate graphs.
## Color
There are so many options in R. It is fun to play around with color, but keep in mind not everyone sees color in the same way, and some folks cannot see certain spectrums of color (i.e., [absence of blue or green receptors is common](https://www.colormatters.com/color-and-vision/what-is-color-blindness)). See below for a example of what colorblindness may do...if you see numbers inside these circles, great, you have some blue/green retinal receptors.
![](https://www.colormatters.com/images/cbtests/plate1.jpg) ![](https://www.colormatters.com/images/cbtests/plate3.jpg)
Having said that, [here's a great cheat sheet of colors in R](https://www.nceas.ucsb.edu/~frazier/RSpatialGuides/colorPaletteCheatsheet.pdf), it can be handy when trying to find the correct color or name of a color. Check out this recent Nature perspective piece on [The misuse of colour in science communication](https://www.nature.com/articles/s41467-020-19160-7?utm_source=twitter&utm_medium=social&utm_content=organic&utm_campaign=NGMT_USG_JC01_GL_NRJournals).
### Discrete/Categorical:
For a very nice discussion about color palettes, I recommend [this page](http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/) from the R cookbook folks.
Additionally, check out the great [ggthemes](https://cran.r-project.org/web/packages/ggthemes/vignettes/ggthemes.html) package, which has many options. One I find very helpful is using `scale_color_colorblind()`, which if you have 8-9 categories, may be a nice way to display your data.
### Continuous: [`viridis`](https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html)
The `viridis` package is an excellent set of colors that better represent your data, are easier to read for those with colorblindness, and they also tend to print fairly well in grayscale.
Take a look at the [vignette online](https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html)!
An example:
![viridis](img/viridis_demo.png)
### Customizing color pallettes in ggplot
The viridis color pallettes have been built into ggplot so that you can call upon them using the scale_fill_viridis_ or scale_color_viridis_ functions. Note that whether or not you use the fill or color scale function depends on which aesthetic you set in your plot. The functions also have extenstions depending on the kind of variable that you want to color: scale_fill_viridis_*d* is for discrete variables while scale_fill_viridis_*c* is for continous variables. Within the function, you can specify which virids pallette to use: A-D.
In this example we fill in our bars with the cut vairable in the diamond dataset, a categorical variable with 5 groups.
```{r, echo=T, eval=T}
library(ggplot2)
ggplot(diamonds, aes(x = clarity, fill = cut)) +
geom_bar() +
theme(axis.text.x = element_text(angle=70, vjust=0.5)) +
scale_fill_viridis_d(option = "C") +
theme_classic()
```
## Visualization Tips {.tabset .tabset-fade}
This information isn't meant to be comprehensive, but at minimum, it may provide some guidance when you are creating plots and figures.
### Visualization Do's
The most basic tip is keep it simple! Stick with a clean and clear message, what is your plot/figure trying to get across? Data visualization is effective when it is simple, and repackages data into a visual story that is easy to understand.
- Label appropriately and legibly, including axes, and use text to highlight important bits
- Use one color to represent each category, consider colorblind/BW friendly palettes
- Order datasets using logical heirarchy (Make it easy for reader to compare values)
- Use icons when possible to reduce unnecessary labeling
- Pay attention to scale (e.g., start axis at zero not 2.4 to 3.5)
- Include your data/outliers where possible
### Visualization Don'ts
A few things to avoid (which basically relates to keeping it simple):
- Don't try to add too much into one plot...**keep it simple**
- Don't add color uncessarily unless it provides a specific function
- Avoid high contrast colors (red/green or blue/yellow)
- Don't use 3D charts. They can make it hard to discern or perceive the actual information.
- Avoid ornamentation (shadowing, extra illustration, etc)
- Avoid more than 6 categorical colors in a layout unless you looking at continuous data.
- Keep fonts simple (avoid uncessary **bold** or *italicization*)
- Don't try to compare too many categories or data types in one chart
### Examples
#### Scatterplots
One of the best simple plots for examining patterns in data, but very effective. Also used when adding model trend lines.
```{r scatterplots}
suppressPackageStartupMessages(library(ggplot2))
plot(x=iris$Petal.Width) # single variable
plot(x=iris$Petal.Width, y=iris$Petal.Length) # multiple variables
ggplot() + geom_point(data=iris, aes(x=Petal.Width, y=Petal.Length))
ggplot() + geom_point(data=iris, aes(x=Petal.Width, y=Petal.Length, fill=Species), pch=21, size=3, alpha=0.5)
```
#### Lineplots
Comparing relative change in quantities across a variable like time. Note the change when we avoid facetting each line independently.
```{r lineplots}
plot(EuStockMarkets)
suppressPackageStartupMessages(library(tidyverse))
EuStockMarkets_df <- data.frame(as.matrix(EuStockMarkets), date=as.numeric(time(EuStockMarkets)))
EuStockMarkets_long <- gather(data = EuStockMarkets_df, key = "Market", value="value", 1:4)
ggplot() + geom_line(data=EuStockMarkets_long, aes(x=date, y=value, color=Market))
```
#### Barplots
Comparing totals across multiple groups. Notice legibility when you stack the bars.
```{r barplots}
# code adpated from https://www.analyticsvidhya.com/blog/2015/07/guide-data-visualization-r/
suppressPackageStartupMessages(library(viridis))
barplot(iris$Petal.Length)
barplot(iris$Sepal.Length,col = viridis(3, option = "A"))
barplot(table(iris$Species,iris$Sepal.Length),col = viridis(3, option = "A"))
```
<br>
<br>
## Non-visual data interaction
Discussions of visualization so far have taken for granted that visualization is accessible to everyone, but researchers and audiences alike are not all sighted. RStudio is behind on blind accessibility, but some packages can provide text descriptions and sonification/audification of plots to improve accessibility for non-visual data interaction.
The `BrailleR` package ([read more here](https://r-resources.massey.ac.nz/BrailleRInAction/)), has a `VI()` function that wraps around plots and provides a text-description output. We can use the VI wrapper to interact with our plots from a textual perspective and identify what information is missing. Such text can also be used as alt-text when publishing this material. What could we do to our plot to improve the text description?
```{r, echo=T, eval=T, warning= F, message = F}
#install.packages("BrailleR")
library(BrailleR)
barplot <- ggplot(diamonds, aes(x = clarity, fill = cut)) +
geom_bar() +
theme(axis.text.x = element_text(angle=70, vjust=0.5)) +
scale_fill_viridis_d(option = "C") +
theme_classic()
VI(barplot)
```
The `sonification` package's `sonify` function can be used to represent data in audio form, where the x-axis can span sound across time, so that the length of time a sound plays follows the data long the x-axis from left to right; the y-axis can be expressed as pitch, so that the pitch of the sound matches to the values of the data (lower value = lower pitch).
```{r, echo = F, eval = T, warning= F, message = F}
library(sonify)
library(seewave)
sonif <- sonify(iris$Petal.Width)
savewav(sonif, file = "sw.width.sound.wav")
```
One variable can be used as an input to hear the distribution. For instance this univariate plot of iris petal width visually looks like this:
```{r, echo=T, eval=T, warning= F, message = F}
plot(iris$Petal.Width)
```
Using `sonify` instead of plot, the distribution of iris petal width sounds like this:
```{r, echo = T, eval = T, warning= F, message = F}
#install.packages("sonify")
library(sonify)
sonify(iris$Petal.Width)
```
If you try this on your computer it should play autmatically, buy on the website you can listen below:
<audio controls>
<source src="sw.width.sound.wav" type="audio/wav">
</audio>
```{r, echo = F, eval = T, warning= F, message = F}
sonif <- sonify(x = iris$Petal.Width, y = iris$Petal.Length)
savewav(sonif, file = "sw.width.length.sound.wav")
```
Two variables can be used as input to hear the relationship between continuous variables. For instance this bivariate plot of iris petal length and petal width visually looks like this:
```{r, echo=T, eval=T, warning= F, message = F}
plot(x = iris$Petal.Width, y = iris$Petal.Length)
```
And using `sonify`, sounds like this:
```{r, echo=T, eval=T, warning= F, message = F}
sonify(x=iris$Petal.Width, y = iris$Petal.Length)
```
<audio controls>
<source src="sw.width.length.sound.wav" type="audio/wav">
</audio>
## Publishing Plots: `cowplot`
A few excellent options exist for creating multi-paneled plots for publications. The first and foremost is a package by Claus Wilke called [`cowplot`](https://cran.r-project.org/web/packages/cowplot/vignettes/introduction.html). With `cowplot`, it's possible to quickly combine existing `ggplots`, creating publication quality plots. See the vignette for more options, but here's a quick example from the vignette below:
```{r, echo F, warning = F, message = F}
detach("package:BrailleR", unload=TRUE)
```
```{r cowplot, echo=T, eval=T}
library(cowplot)
# make a few plots:
plot.diamonds <- ggplot(diamonds, aes(clarity, fill = cut)) +
geom_bar() +
theme(axis.text.x = element_text(angle=70, vjust=0.5))
#plot.diamonds
plot.cars <- ggplot(mpg, aes(x = cty, y = hwy, colour = factor(cyl))) +
geom_point(size = 2.5)
#plot.cars
plot.iris <- ggplot(data=iris, aes(x=Sepal.Length, y=Petal.Length, fill=Species)) +
geom_point(size=3, alpha=0.7, shape=21)
#plot.iris
# use plot_grid
panel_plot <- plot_grid(plot.cars, plot.iris, plot.diamonds, labels=c("A", "B", "C"), ncol=2, nrow = 2)
panel_plot
# fix the sizes draw_plot
fixed_gridplot <- ggdraw() + draw_plot(plot.iris, x = 0, y = 0, width = 1, height = 0.5) +
draw_plot(plot.cars, x=0, y=.5, width=0.5, height = 0.5) +
draw_plot(plot.diamonds, x=0.5, y=0.5, width=0.5, height = 0.5) +
draw_plot_label(label = c("A","B","C"), x = c(0, 0.5, 0), y = c(1, 1, 0.5))
fixed_gridplot
```
## Saving Figures and Plots
A plot you created with ggplot or another plotting package can be saved as .JPEGS (or .tiff, .img, etc) onto you. For any ggplot objects, we recommend using `ggsave`.
First, let's create a new folder in this project called `figures`. Let's save all the figures we create to that folder. `ggsave` will default to saving the last plot you created, however, we think it is always a good idea to specify exactly which plot you want saved. To do that, we have to save our plot as an object.
```{r, eval = F}
ggsave("figures/gridplot.png", fixed_gridplot)
```
#### With `ggsave` you can save images as
- .png, .jpeg, .tiff, .pdf, .bmp, or .svg
#### Other arguments of `ggsave`
- `scale` can scale the image (multiplicative scaling factor)
- `width` and `height` let you specify the size of the image in `units` that you specify
- `dpi` can change the quality of the image; for publication graphs we suggest over 700 dpi
```{r, eval = F}
ggsave("figures/gridplot.png", fixed_gridplot, width = 6, height = 4, units = "in", dpi = 700)
```
## The package `plotly`
The package `plotly` is an excellent, easy to use resource that allows you to quickly create interactive, web-based figures. We have used plotly to illustrate patterns in data to collaborators, to visualize patterns in our data during the pre-analysis stage, and even impress with "fancy" graphics during a conference talk!
Let's try plotly together using some of the `iris` data:
```{r, warning = F, message =F}
library(plotly)
plot.iris <- ggplot(data=iris, aes(x=Sepal.Length, y=Petal.Length, fill=Species)) +
geom_point(size=3, alpha=0.7, shape=21)
plotly::ggplotly(plot.iris) #it's as simple as that!
```
This lesson is adapted from the Software Carpentry: R for Reproducible
Scientific Analysis [Vectors and Data Frames materials](http://data-lessons.github.io/gapminder-R/03-data-types-subsetting.html)
and the Data Carpentry: R for Data Analysis and Visualization of Ecological Data
[Exporting Data materials](http://www.datacarpentry.org/R-ecology-lesson/03-dplyr.html).