-
Notifications
You must be signed in to change notification settings - Fork 13
/
lesson_01_intro_r_rstudio.Rmd
578 lines (410 loc) · 19.7 KB
/
lesson_01_intro_r_rstudio.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
---
title: "Orientation to programming, R, and RStudio"
---
<br>
<div class = "blue">
### Learning objectives
* Appreciate why a researcher might want to write code and why R specifically
* Gain familiarity the RStudio IDE
* Use basic math functions in R
* Understand objects and how to assign to them
* Use comparison operations
* Understand errors, warnings, and messages
* To be able to seek help via `?` and Google
</div>
<br>
# Why programming?
Programming can make your science even better than it already is.
The basis of programming is that we write down instructions for the computer to
follow, and then we tell the computer to follow those instructions. We write, or
*code*, instructions in R because it is a common language that both the computer
and we can understand. We call the instructions *commands* and we tell the
computer to follow the instructions by *executing* (also called *running*) those
commands.
The benefits of programming parallel many of the cornerstones of science.
Programming makes your workflow:
- Precise and flexible
- Efficient
- Reproducible
- Transparent
# Why R?
R is a free, open-source programming language that is designed for data analysis
and statistics.
R also has a huge user-community and is highly extensible, with
thousands of packages that add extra functionality. Lots of researchers use R,
so it is also a common language between us and our colleagues. In short, for
many researchers, it is the best tool to organize,
visualize, and analyze data.
# Why RStudio?
RStudio is essentially an interface to work with R, with lots of nice features and bells and whistles. It's like the difference between driving a really old car and a nice new one with GPS and Bluetooth and heated seats and power steering: they both use 4 wheels to get you where you need to go, but one will get you there in comfort and style, and maybe keep you safer along the way.
RStudio is an IDE (integrated development environment) which we use to manage
and execute R code. It is also free and open-source, it works on all platforms
(e.g. you can use an Amazon Web Services cluster using RStudio), and it
integrates version control and project management.
You write the same R code in RStudio as you would elsewhere, and it executes the
same way. RStudio helps by keeping things nicely organized.
# Introduction to RStudio
When you open RStudio you should see three panels:
1. The interactive R **Console** (entire left)
1. **Environment/History** (tabbed in upper right)
1. **Files/Plots/Packages/Help/Viewer** (tabbed in lower right)
The placement of these panes and their content can be customized (see menu,
Tools -> Global Options -> Pane Layout).
One of the advantages of using RStudio is that all the information
you need to write code is available in a single window. Additionally, with many
shortcuts, autocompletion, and highlighting for the major file types you use
while developing in R, RStudio will make typing easier and less error-prone.
## Workflow within Rstudio
Console vs. script
You can think of working in the Console vs. working in the Script as something like cooking. The console is like making up a new recipe, but not writing anything down. You can carry out a series of steps and produce a nice, beautiful dish at the end, but you didn't write anything down, so it's harder to figure out exactly what you did, and in what order. Writing a script is like taking nice notes while cooking- you can tweak and edit the recipe all you want, you can come back in 6 months and try it again, and you don't have to try to remember what went well and what didn't.
- Console
- The R console is where code is run/executed
- When you start RStudio, you'll see a bunch of information, followed by a
">" and a blinking cursor.
- You can type in commands here and, by pressing `Enter`, R will execute
those commands and print the result.
- You can work here, and your history is saved, but that is a laborious way
to work
- Script
- Preserve work in a plain-text file (with .R extension)
- Create new R script with `File -> New File -> R Script` or
'ctrl/cmd-shift-N'
- There's now a fourth RStudio panel, which is your plain-text script
- Do your work here, and save this to be able to reproduce or edit it at
a later date.
- For now your script is unsaved and called "Untitled1" or something.
We'll fix that shortly.
- `cmd/ctrl-enter` executes the line the cursor is on by copying that line
and sending it to the Console
- You can run multiple lines at once by highlighting them and pressing
`cmd/ctrl-enter`
- Benefits of working in a script:
- Mixes interactivity and preservation
- Save just text and can get same results at another time or on another
machine
- Building preservable pipeline of operations
<br>
<div class = "blue">
### Tip: Pushing to the interactive R console
To run the line of your script where the cursor is, you can click on the `Run` button at the top-right of the script pane or use the keyboard short cut: `cmd/ctrl-enter`.
To run a block of code, select (highlight) it and click `Run` or `cmd/ctrl-enter`.
You are working toward selecting a whole script and running it.
- You'll write your script interactively, running each line to make sure it works, and at the end, you'll be able to run the whole analysis by selecting all and running the script. This way you can later rerun the analysis on new or modified data or change part of the analysis and everything will work with the click of a button.
</div>
<br>
# R Projects
One of the biggest benefits in using RStudio is that you get to use R Projects, which are extremely helpful for keeping a given project neatly organized and consistent. An R Project is essentially a single folder/directory on your computer, containing all the files you need for that given project. It should be totally self-contained, with all your data, scripts, and results inside it. What makes it an "R Project" is a small .RProject file that sits inside the project folder, which tells RStudio that this folder is for a single R Project.
If you take a look in the very upper-right corner of the RStudio window, you should see a little cube icon with an R in it, possibly with "Project (None)" next to it. If you click on this, you'll see some options to create a New Project, Open Project, etc.
# Always Start Clean
By default, RStudio is set up so that every time you reopen a project, the environment is exactly as you left it. This might sound great, but it can actually get you into a bit of trouble! Think back to our recipe analogy for a second. You want to start the recipe fresh each time you make the dish- if you were good about writing down the directions (writing an R script), then you should be able to recreate everything the same as before. However, RStudio bringing stuff back in can be more like leaving residue on pots and pans, messing with your nice start-to-finish recipe.
We're going to change the settings in RStudio so that you'll be working with a nice clean environment each time. As long as you save your scripts, you shouldn't lose any work, and you'll avoid a lot of unforseen problems when you open a project up a few months later. In the menu bar at the top of your computer, navigate to Tools, then Global Options. Find the box that says "Restore .RData into workspace at startup" and make sure it is **unchecked**.
<br>
<div class = "blue">
### Tip
While you're in Global Options, you can navigate to the Code tab on the lefthand side of the window. Find the box that says "Soft-wrap R source files" and make sure it is **checked**.
This will make long lines of code in an R script wrap around onto a new line so you don't have to do a huge horizontal scroll, which is super annoying.
</div>
<br>
# Introduction to R
The simplest thing you can do with R is do arithmetic:
```{r chunk1}
1 + 100
```
And R will print out the answer, with a preceding "[1]", which indicates the
first item of output.
If you type in an incomplete command, R will wait for you to complete it:
~~~ {.r}
> 1 -
~~~
~~~ {.output}
+
~~~
Any time you execute code and the R session shows a "+" instead of a ">", it
means it's waiting for you to complete the command. If you want to cancel
a command you can simply hit "Esc" and RStudio will give you back the ">"
prompt. You can also cancel commands with "Esc" if R is taking too long to
finish a calculation.
R can use order of operations, just like you learned way back in algebra.
```{r chunk2}
3 + 5 * 2
```
Use parentheses to group to force the order of evaluation, and/or to make
code easier to read.
```{r chunk3}
(3 + 5) * 2
```
## Whitespace
Speaking of being easy to read, whitespace is ignored by R. Use it consistently
to make code readable. For example, putting a single space on either side of an
operator makes code easy to read.
```{r chunk4, eval=FALSE}
(3 + (5 * (2 ^ 2))) # hard to read
3 + 5 * 2 ^ 2 # easier to read, once you know rules
3+5*2^2 # very hard to read
3 + 5 * (2 ^ 2) # to make order of operations clear, use parentheses
```
## Comments
The text that appears to the right of each line of code above is called a comment. Anything that
follows the hash symbol -- `#` -- is ignored by R.
Liberally add comments to your code as you write. Things that are clear as you write them will be
mysterious to others, including your-future-self! Commenting takes little time and will save you
time and headaches in the long run.
## Scientific Notation
Really small or large numbers get a scientific notation:
```{r chunk5}
2/10000
```
Which is shorthand for "multiplied by `10^XX`". So `2e-4`
is shorthand for `2 * 10^(-4)`.
You can write numbers in scientific notation too:
```{r chunk6}
1e9 # One billion
```
## Mathematical functions
R has many built in mathematical functions. To call a function,
type its name, follow by open and closing parentheses.
Anything we type inside those parentheses is an "argument" to that function.
Here we call the `sin` function and provide it the argument 3.14, or
approximately $\pi$.
```{r chunk7}
sin(3.14) # trigonometry functions
```
We can take a logarithm:
```{r chunk8}
log(3) # natural logarithm
```
Or exponentiate:
```{r chunk9}
exp(0.5) # e^(1/2)
```
## Nested Functions
You can even put functions inside each other. `exp(0.5)` raised `e` to the `1/2`
power. Equivalently we could take the square-root of `e`. Expressions are
interpretted from the inside-out: In the following line, R first takes `e^4`, and then takes the square-root (that's what the `sqrt` function does) of the result.
```{r chunk10}
sqrt(exp(4))
```
You don't need to remember function names. There are many ways to discover or
rediscover them when you need them. Google is your friend, but we will discuss
other ways soon.
## Comparison
We can do logical comparison in R. This will be important later, for example,
when we want to filter a dataset based on a logical condition.
```{r chunk11}
1 == 1 # equality (note two equals signs, read as "is equal to")
```
```{r chunk12}
1 != 2 # not-equal (read, "is not equal to")
```
```{r chunk13}
1 < 2 # less than
```
```{r chunk14}
1 >= -9 # greater than or equal to
```
## Objects and assignment
Now we're getting to the actual coding stuff. You can store values/data in "objects", also known as variables. An object has a name, and it has something stored in it, which could be a number, a string of characters, or even a whole dataset.
We can store values in objects using the assignment operator `<-`. You can also use a single equals sign, `=`, for assignment.
Note that unlike every other expression we have run so far, R doesn't print
anything when we run this next line. Instead, it is stored for later in a
**object**, `x`. `x` now contains the **value** `0.25`. Read this as
"Assign 1/4 to x."
```{r chunk15}
x <- 1/4
```
Look for the `Environment` tab in one of the panes of RStudio, and you will see
that `x` and its value have appeared. Our object `x` can be used in place of a
number in any calculation that expects a number:
```{r chunk16}
x
```
```{r chunk17}
log(x)
```
This doesn't change the value of `x` or store the result anywhere, it simply
prints the answer to the console.
Objects can be reassigned:
```{r chunk18}
x <- 99
```
`x` used to contain the value 0.25 and and now it has the value 99.
Assignment values can contain the object being assigned to:
```{r chunk19}
x <- x + 1
```
Finally, objects can have values assigned using other objects:
```{r}
y <- x*10
```
<br>
<div class = "blue">
### MCQ -- object Assignment
What does the following code print?
```
a <- 1
b <- 2
c <- a + b
b <- 4
a <- b
c <- a
c
```
```
Option 1) a
Option 2) 3
Option 3) 4
Option 4) ::nothing::
```
</div>
<br>
## Object name conventions
Object names can contain letters, numbers, underscores and periods. They
cannot start with a number nor contain spaces at all. Different people use
different conventions for long object names, especially:
- underscores\_between_words
- camelCaseToSeparateWords
What you use is up to you, but **be consistent**.
Use descriptive object names, as they make your code easier to understand. It
will save time because you'll remember what each object is: It's easier to
remember what `domesticPopulation` is than `dp` or `x`. A silly example:
```{r chunk20}
theNumberNine <- 9
```
## Tab completion
Tab-completion is a really nice feature of RStudio that saves typing and avoid
typos. After you assign 9 to `theNumberNine`, if you start typing `t...`,
`th...`, etc., and then pressing `tab`, RStudio will pull up a box of all the
valid ways to finish that word. You can scroll through them using the up- and
down-arrows and press enter to choose the one you want. If you press tab when
there is only one valid way to complete something, RStudio will automatically
complete it.
# Understanding functions & getting help
## R help files
Once you figure out what function you want, you need to figure out how to use
it. Every function has an associated help-file. They can be hard to read,
especially at first, but it is important to learn how to make sense of them.
`?function` brings up help-file. E.g.
```{r getting help, eval = FALSE}
?log
```
Each help-file contains the following components.
- Description: An extended description of what the function does.
- Usage: The arguments of the function and their default values.
- Arguments: An explanation of the data each argument is expecting.
- Details: Any important details to be aware of.
- Value: The data the function returns.
- See Also: Any related functions you might find useful.
- Examples: Some examples for how to use the function.
## Other ways to get help
- `??` searches the text of all R help files, e.g. `??base` will find `log`.
- [Google](https://www.google.com/)
- [Stack Overflow](https://stackoverflow.com/)
- [RStudio cheat sheets](http://www.rstudio.com/resources/cheatsheets/)
- [Cookbook for R](http://www.cookbook-r.com/)
## Arguments to functions
- Can be specified by order or by name
- Before, when we entered `log(3)`, `log` knew `3` was `x` because it was in the
first position, but we could have also told `log` explicitly that `3` is the
value `x` should take. These are the same:
```{r named-args}
log(3)
log(x = 3)
```
- Some arguments have default values, e.g. `log`'s `base` defaults to `exp(1)`,
*e*, unless you tell it otherwise. So these are identical:
```{r arg-defaults}
log(x = 3)
log(x = 3, base = exp(1))
```
To get the base 10 logarithm of 3, you could do
```{r changing-default-args}
log(x = 3, base = 10)
```
If you provide a function with arguments by name, they can go in any order.
Otherwise, they have to appear in the order specified by the function. These are
all the same:
```{r ordering-args}
log(3, 10)
log(x = 3, base = 10)
log(base = 10, x = 3)
```
Remember tab-completion? Well it **works with functions too**! Type the name of a function, followed by parentheses, and make sure your cursor is between the parentheses. If you hit tab, a window should pop up showing the names of the arguments, with little purple rectangles next to them.
```{r, eval = F}
log()
```
If you use your arrow keys, you can move around, and RStudio will display a little description of each argument. If you hit Enter, it will paste that argument and an equals sign in between the parentheses.
```{r, eval=F}
log(x = )
```
You can type in whatever you want for that argument, type a comma, then hit Tab again, and RStudio will bring up a list of all *remaining* arguments.
```{r, eval = F}
log(x = 3, )
```
<br>
<div class = "blue">
### MCQ -- Which of these things is not like the other ones?
Three of the following lines produce the same result. Without running the code, which one will produce a different result than the others? The helpfile for `log` (`?log`) may be helpful.
```
Option 1) log(x = 1000, base = 10)
Option 2) log10(1000)
Option 3) log(base = 10, x = 1000)
Option 4) log(10, 1000)
```
</div>
<br>
## When R Wants to Tell You Something
Besides the value of an expression R has executed, there are a few other kinds
of responses you might get from R, including errors, warnings, and messages.
### Errors
R returns an error when it cannot proceed. It stops you in your tracks. The
error message will provide some information on what the problem was, but it is
often cryptic. Learning to understand these messages is important but takes
practice. Here's an example of an error:
```{r Error1, error = TRUE}
log_of_a_word <- log("a_word")
```
R tell us that something has gone wrong: It got a non-number for a function that
needs a number. Note that errors prevent execution of the line, so nothing got
assigned to `log_of_a_word` there. If we ask R what it thinks `log_of_a_word`
is, it will return another error. Practice understanding R's communication
style: Do you understand how R is telling you what the problem is?
```{r Error2, error = TRUE}
log_of_a_word
```
### Warnings
Warnings appear in the same red font in the console, but they start with
"Warning" instead of "Error". Warnings are R's way of telling you that it did
something, but it suspects it may not have been what you wanted. *Warnings can
be more insidious than errors* because you can keep going, but keep going with a
mistake in your pipeline. Here's an example:
```{r Warning1, warning=TRUE}
log_of_a_negative <- log(-2)
```
`NaN` means "not a number", and R has kindly told us, "Hey, I think you probably
wanted a number here -- taking a log of a negative is kind of a weird thing to
do. I can do it if you really want, I just want to be make sure it's what you
want."
Note that it did work, so if we ask R what `log_of_a_negative` is, we won't get
an error. Note that we don't get a warning either, so you need to pay attention
when warnings first appear.
```{r Warning2, warning=TRUE}
log_of_a_negative
```
### Messages
There's a third source of red text in R: messages. These are R's way of telling
you that something happened, but it's probably nothing to worry about. These
don't start with "Message"; they just print the red text.
<br>
<div class = "blue">
### Challenge
Which elephant weighs more? Convert one's weight to the units of the other, and store the result in an appropriately-named new object. Write a bit of code to test whether elephant1 weights more than elephant2 (1 kg ≈ 2.2 lb).
```{r, eval=FALSE}
elephant1_kg <- 3492
elephant2_lb <- 7757
```
</div>
<br>
This lesson is adapted from the Software Carpentry: R for Reproducible
Scientific Analysis [Introduction to R and RStudio materials](http://data-lessons.github.io/gapminder-R/01-intro-r-rstudio.html)
and the Data Carpentry: R for data analysis and visualization of Ecological Data
[Before We Start materials](http://www.datacarpentry.org/R-ecology-lesson/00-before-we-start.html).