-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathex2.2.Rmd
91 lines (76 loc) · 2.33 KB
/
ex2.2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
---
title: "Exercises for 2.2"
output: html_document
---
```{r setup, include=FALSE}
Sys.setenv(PATH=paste('/home/leo/apps/miniconda3/envs/pandas023test/bin:/home/leo/apps/miniconda3/bin', Sys.getenv('PATH'), sep = ':'))
library(reticulate)
use_condaenv(condaenv = 'pandas023test', conda = "/home/leo/apps/miniconda3/bin/conda", required = TRUE)
```
# Q1
List five functions that you could use to get more information
about the *mpg* dataset:
```{r}
library(ggplot2)
names(mpg)
colnames(mpg)
str(mpg)
dim(mpg)
nrow(mpg)
ncol(mpg)
head(mpg)
summary(mpg)
library(dplyr)
glimpse(mpg)
sample_n(mpg, 5)
```
# Q2
How can you find out what other datasets are included with ggplot2?
```{r}
#data(package = 'ggplot2')
```
# Q3
Apart from the US, most countries use fuel consumption (fuel consumed over fixed distance) rather than fuel economy (distance travelled with fixed amount of fuel).
How could you convert cty and hwy into the European standard of l/100km?
```{r}
fuel.consump.to.fuel.economy <- function(fuel.consump) {
miles.per.litre <- fuel.consump / 3.78541
km.per.liter <- miles.per.litre * 1.60934
liter.per.km <- 1 / km.per.liter
100 * liter.per.km
}
mpg$cty.eco <- fuel.consump.to.fuel.economy(mpg$cty)
mpg$hwy.eco <- fuel.consump.to.fuel.economy(mpg$hwy)
head(mpg, 3)
```
# Q4
Which manufacturer has the most the models in this dataset? Which model has the most variations?
Does your answer change if you remove the redundant specification of drive train (e.g. “pathfinder 4wd”, “a4 quattro”) from the model name?
```{r}
mpg %>%
select(manufacturer, model) %>%
group_by(manufacturer) %>%
summarise(n = n()) %>%
arrange(desc(n))
mpg %>% count(manufacturer, sort = T)
mpg %>% count(model, sort = T)
```
dodge has most models.
caravan 2wd has most variations.
**Note**:
*dplyr*'s `group_by()` function is different from its Pandas(Python) and Spark equivalent.
Its return value is still a *dataframe*:
```{r}
mpg %>% group_by(manufacturer) %>% head(3)
mpgh3 <- mpg %>% head(3)
```
While the return value of pandas dataframe is a *DataFrameGroupBy* object:
```{python}
import pandas as pd
df = pd.DataFrame({'Animal' : ['Falcon', 'Falcon', 'Parrot', 'Parrot'],
'Max Speed' : [380., 370., 24., 26.]})
print(df)
grps = df.groupby(['Animal'])
print('The return value of groupby function: \n%s' % grps)
print(grps.mean())
```