-
Notifications
You must be signed in to change notification settings - Fork 2
/
Intro to R.Rpres
executable file
·475 lines (412 loc) · 15.8 KB
/
Intro to R.Rpres
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
Intro to R
========================================================
author: Jeho Park, Stephen Park
date: August 4, 2018
autosize: true
Agenda
========================================================
- About R
- R Basics
- Working With Data
- Hands-On Project
- Further Learning
This R Workshop is...
========================================================
* for newbies.
- If you already have some experience in R, please raise your hands.
* about R's programming aspects.
- It's designed to help you start using and coding R.
* not about Statistics.
- I assume that you already have some basic knowledge of Statistics
* for you to start learning R afterwards.
- This workshop is designed to give you a nudge to learn more R down the road. There are many resources to learn R in depth.
Learning Objectives:
========================================================
_By the end of this this workshop, you will be able to:_
* Install R libraries/packages.
* Import and export data from/to a simple CSV file in R.
* Distinguish the differences between R data objects and convert them.
* Use subsetting methods to create subsamples.
* Create basic plots using both base plot in R.
* Create R markdown documents and R slides.
* Create tidy data and manipulate those data using dplyr package.
What is R?
========================================================
* R is a statical programming language/environment.
* R is open source and free.
* R is widely used.
* R is cross-platform.
* R is hard to learn (really?).
What is not R?
========================================================
* S: R's ancestor
* S-Plus: Commercial; modern implementation of S
* SAS: Commercial; widely used in the commercial analytics.
* STATA: Commercial; widely used by economists.
* SPSS: Commercial; easy to use; widely used in Social Science.
* MATLAB: Commercial; can do some Stats.
* Python: Also can do some Stats; one of the two mostly used programming languages in data science.
Then Why R?
========================================================
* __R community is active and constantly growing__
* R is one of the most popular stat programming lang
* R has tons of user generated libraries/packages
* R code is easily shared with others
* R is constantly improved
Then Why R?
========================================================
* R community is active and constantly growing
* __R is one of the most popular stat tools/programming languages__
* R has tons of user generated libraries/packages
* R code is easily shared with others
* R is constantly improved
###### Further reading about R's popularity in science and engineering:<br>
###### R moves up to 5th place in IEEE language rankings at https://www.r-bloggers.com/r-moves-up-to-5th-place-in-ieee-language-rankings/
###### https://spectrum.ieee.org/static/interactive-the-top-programming-languages-2017 (6th in 2017)
Then Why R?
========================================================
IEEE Spectrum: Popular Programming Languages in 2017
![Poplar Programming Languages](The_Top_Programming_Languages_2017_-_IEEE_Spectrum.png)
Then Why R?
========================================================
* R community is active and constantly growing
* R is one of the most popular stat tools/programming languages
* __R has tons of user generated libraries/packages__
* R code is easily shared with others
* R is constantly improved
Getting help online and offline
===============================
### On the Internet:
* Google! (Now Google knows "r" means the R programming language)
* Stack Overflow: http://stackoverflow.com/questions/tagged/r
* Cross-Validated: the statistics Q&A site http://stats.stackexchange.com/
* Rseek meta search engine: http://rseek.org/
* R-help listserv: https://www.r-project.org/mail.html
R Basics
========================================================
Let's get ready for R-ing!
Open your RStudio app.
Open your browser.
Getting Started - RConsole
========================================================
- Upon opening RStudio, you will see the Console
- ![RConsole](rconsole.png)
Getting Started - RConsole
========================================================
- You can perform some calculations by typing into it.
```{r}
2+2
```
Getting Started - Variables
========================================================
- You can also use variables and assign them values.
```{r}
x = 2
x+2
```
```{r}
# The "<-" operator is the same as "="
y <- x+1 # "<-" is preferred
x+y
```
Getting Started - R Environment
========================================================
- When you create a variable it will show up on the environment pane in RStudio
![Environment Pane](renvironment.png)
R Basics - Arithmetic Operators
========================================================
- Operators are symbols that use some input and *operate* on them.
- Arithmetic operators are our common mathematical symbols.
```{r echo='false'}
x*x # Multiplication
x/y # Division
x^3 # Exponentiation
```
R Basics - Logical Operators
========================================================
- Logical operators output either True or False
```{r}
2 < 5 # Less Than
x >= 2 # Greater Than Or Equal To
```
R Basics - Logical Operators
========================================================
```{r}
x == 1 # Equal to. NOTE: This is different from "="
x != x # Not Equal to
```
R Basics - Data Types
========================================================
- R can work with more than just numbers.
```{r}
class("hello, world!") # The class function returns the data type of a variable or value.
class(x == y)
```
R Basics - Vectors
========================================================
- Up until now, we have been working with single numbers and variables also known as scalars.
- In R, there are also vectors which are sequences of values.
```{r}
a <- c(1,2,3)
b <- c(3:1)
for(i in 1:3) {
print(a[i]+b[i])
}
```
R Basics - Vectors - 2
========================================================
- R's basic data object is "vector"
- Operators are optimized for vector (element-wise) operations
- Use vectorized operations than loops whenever possible
```{r}
a + b
a^2
```
R Basics - Built-In Functions
========================================================
- Functions take an input and produce an output.
- R contains many functions that are built-in, some of which we have already seen such as class() and c().
```{r}
abs(-5) # Absolute Value
sqrt(4) # Square Root
```
R Basics - Built-In Functions - 2
========================================================
```{r}
sum(a) # Sum
toupper("hello") # To Upper Case
```
R Basics - Built-In Functions - 3
========================================================
```{r}
sd(a) # Can you guess what this funtion calc?
cor(a,b) # How about this?
```
R Basics - Libraries and Packages
========================================================
- R is known for its community and its libraries/packages
- Packages are collections of R functions, data, and compiled code in a well-defined format. The directory where packages are stored is called the library.
- We will install the dplyr package, a popular package for data manipulation.
- ```
install.packages('dplyr')
```
```{r}
library('dplyr')
```
R Basics - Help
========================================================
- With all these packages and functions you may forget how to use one or need to remember which one to use.
```{r}
# This will search for usage of the hist function
?hist
```
```{r}
# This will search packages and functions about histograms
??histogram
```
Exercise 1
==============
- Review of some basic R features
- Open ./exercises/intro/exercise_1.Rmd (this is a R markdown file)
Working with Data
========================================================
Working with Data - DataFrames
========================================================
- A data frame is used for storing data tables. It is a list of vectors of equal length.
```{r}
names <- c("John", "Kim", "Terry")
ages <- c(21, 34, 16)
male <- c(TRUE, FALSE, TRUE)
data.frame(names, ages, male)
```
Working with Data - DataFrames - Datasets
========================================================
- R's datasets package has some built in datasets that we will be using.
- The CO2 data frame has 84 rows and 5 columns of data from an experiment on the cold tolerance of the grass species Echinochloa crus-galli.
```{r}
help(CO2)
```
- You can use '$' to access a vector (column/feature/etc) of a dataframe
Working with Data - Data Exploration - Head
========================================================
```{r}
# The head function shows the first few entries in a dataframe.
head(CO2)
```
Working with Data - Data Exploration - Summary
========================================================
```{r}
# The summary function shows summary statistics.
summary(CO2)
```
Working with Data - Data Visualization
========================================================
```{r, eval=FALSE}
hist(CO2$uptake) # Use help function for more plotting options
```
Working with Data - Data Visualization - Boxplot
========================================================
```{r, eval=FALSE}
boxplot(CO2$conc)
```
Working with Data - Data Visualization - Scatterplot
========================================================
```{r, eval=FALSE}
plot(CO2$uptake, CO2$conc)
```
Exercise 2
==============
- Review of datasets and data visualization
- Open ./exercises/intro/exercise_2.Rmd
Working with Data - Data Subsseting (classic) - 1
=========================
Operators that can be used to extract subsets of R objects.
* '[' and ']' always returns an object of the same class as the original; can be used to select more than one element.
* '[[' and ']]' is used to extract elements of a list or a data frame; it can only be used to extract a single element.
* $ is used to extract elements of a list or data frame by name.
Working with Data - Data Subsetting (classic) - 2
==========================
```{r, eval=FALSE}
x <- c("a", "b", "c", "c", "d", "a")
x[1]
x[1:4]
x[x > "a"]
u <- x > "a" # what's u here?
u
x[u] # subsetting using a boolean vector
y <- list(foo=x, bar=x[u])
y
y[[1]]
y$bar
```
Working with Data - Data subsetting (classic) - 3
========================================================
- Create a dataframe containing the first 10 observations.
- Create a dataframe where only the Plant identifier is Qn3.
- Create a dataframe only containing observations with the CO2 concentration level greater than 600.
```{r, eval=FALSE}
co2_10obs <- CO2[<row>, <col>]
co2_Qn3 <- CO2[?, ?]
co2_gt600conc <- subset(CO2, conc > 600)
```
Working with Data - Data Manipulation (dplyr)
========================================================
- Dplyr is the most common package used for data exploration and transformation
- Dplyr functions: filter, select, arrange, mutate, summarise (plus group_by)
```{r}
head(filter(CO2, Treatment=='nonchilled'))
```
Working with Data - Data Manipulation (dplyr) - Select
========================================================
- ```select``` will output only the columns that you choose
```{r}
head(select(CO2, Plant, conc, uptake), 3)
```
Working with Data - Data Manipulation (dplyr) - Chaining
========================================================
- You can chain/pipe dplyr functions together
- The infix (or pipe) operator '%>%' will feed the resulting object into the 1st paramater of the next function
```{r}
x <- select(filter(CO2, Treatment=='nonchilled'), Plant, conc, uptake) # OR
y <- CO2 %>% filter(Treatment=='nonchilled') %>% select(Plant, conc, uptake)
x == y
```
Working with Data - Data Manipulation (dplyr) - Arrange
========================================================
- ```arrange()``` sorts rows by a column
```{r}
# Use desc() to sort descending
CO2 %>% arrange(desc(uptake)) %>% head(3)
```
Working with Data - Data Manipulation (dplyr) - Mutate
========================================================
- ```mutate()``` creates new variables
```{r}
CO2 %>% mutate(conc_L = conc / 1000) %>% head()
```
Working with Data - Data Manipulation - Summarise
========================================================
- ```group_by()``` and ```summarise()``` work together to aggregate variables
```{r}
CO2 %>% group_by(Type) %>% summarise(avg_uptake = mean(uptake))
```
Working with Data - Data Manipulation (dplyr) - Tips
========================================================
- In datasets with a lot of columns you can select all columns but selected ones by using the '-' operator ```select(CO2, -Plant)```
- When aggregating data you need an aggregate function such as mean, median, n (number of rows in a group), sum, etc.
- If your data has missing values, you need to add the parameter ```na.rm = TRUE```, this will skip any missing values.
- When working with data, the data manipulation process often takes more time than the analysis.
Exercise 3
==============
- Review of some data manipulation functions in dplyr
- Open ./exercises/intro/exercise_3.Rmd
Tydyverse
===================
Further Learning - Beginner/Intermediate
========================================================
- Beginner/Intermediate Topics
- User Defined Functions
- Control Structures
- Statistical Methods
- Data Cleaning
- Data Visualization Libraries (ggplot2, plotly)
- https://plot.ly/r/dashboard/
Further Learning - Advanced
========================================================
- Advanced Topics
- Deep Learning
- Functional Programming
- Object Oriented Programming
- Debugging Tools
- Performance
Hands-On Project
========================================================
Hands-On Project
========================================================
- In this workshop we have been using the console but when working on a project you will generally use the editor so you can save your scripts. This helps anyone reproduce your findings.
- When using a script, you must run your code either all at once (Run All) or Line-by-Line
Hands-On Project - Reading the Data
========================================================
- Download the data to your computer at http://bit.ly/r-hands-on-s2018
- You can import the data to R either by code or by clicking Import Dataset on the Environment Tab
```{r}
# You need to put the full path to the file in quotes or change the working directory [getwd(), setwd()]
Births <- read.csv("Births2015.csv")
Auto <- read.csv("Auto.csv")
```
Hands-On Project - Analysis
========================================================
1. What is the name of the 1st car in the dataset?
2. What is the total number of babies born in 2015?
3. Make a boxplot of car mpg.
4. Make a histogram of number of births.
5. How many babies are born on Wednesdays?
6. Which date had the least amount of babies born?
7. Is there a relationship between the number of cylinders and mpg?
8. Plot the average mpg for cars of each number of cylinders?
Hands-On Project - Answers
========================================================
1. ```head(Auto,1)```: chevrolet chevelle malibu
2. ```sum(Births$births)```: 3978497
3. ```boxplot(Auto$mpg)```
4. ```hist(Births$births)```
5. ```Births %>% group_by(wday) %>% summarise(sum=sum(births)) %>% arrange(sum)```: 638513
6. ```arrange(Births, births)```: 12/25/2015
7. ```cor(Auto$mpg, Auto$cylinders)``` or ```plot(Auto$cylinders, Auto$mpg)```
Hands-On Project - Answer #8
========================================================
```{r}
x <- Auto %>% group_by(cylinders) %>% summarise(mean(mpg))
plot(x$cylinders, x$`mean(mpg)`)
```
Hands-On Project - Answer #8 - 2
========================================================
```{r}
plot(x$cylinders, x$`mean(mpg)`, type='o', xlab="Cylinders", ylab="Average MPG", main="Cylinders vs. Average MPG") # Cleaner Version
```
Resources
========================================================
- Swirl (http://swirlstats.com/)
- DataCamp (https://www.datacamp.com/courses/tech:r)
- R for Data Science by Hadley Wickham (http://r4ds.had.co.nz/) [FREE]
- Statistics.com (https://www.statistics.com/landing-page/r-courses/)