-
Notifications
You must be signed in to change notification settings - Fork 129
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #38 from krlmlr/feature/vignette
- Include vignette (#38).
- Loading branch information
Showing
2 changed files
with
164 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,157 @@ | ||
--- | ||
title: "Data frames" | ||
date: "`r Sys.Date()`" | ||
output: rmarkdown::html_vignette | ||
vignette: > | ||
%\VignetteIndexEntry{Data frames} | ||
%\VignetteEngine{knitr::rmarkdown} | ||
\usepackage[utf8]{inputenc} | ||
--- | ||
|
||
```{r, echo = FALSE, message = FALSE} | ||
knitr::opts_chunk$set(collapse = T, comment = "#>") | ||
options(tibble.print_min = 4L, tibble.print_max = 4L) | ||
library(magrittr) | ||
library(tibble) | ||
``` | ||
|
||
## Creating | ||
|
||
`data_frame()` is a nice way to create data frames. It encapsulates best practices for data frames: | ||
|
||
* It never changes an input's type (i.e., no more `stringsAsFactors = FALSE`!). | ||
|
||
```{r} | ||
data.frame(x = letters) %>% sapply(class) | ||
data_frame(x = letters) %>% sapply(class) | ||
``` | ||
This makes it easier to use with list-columns: | ||
```{r} | ||
data_frame(x = 1:3, y = list(1:5, 1:10, 1:20)) | ||
``` | ||
List-columns are most commonly created by `do()`, but they can be useful to | ||
create by hand. | ||
* It never adjusts the names of variables: | ||
```{r} | ||
data.frame(`crazy name` = 1) %>% names() | ||
data_frame(`crazy name` = 1) %>% names() | ||
``` | ||
* It evaluates its arguments lazily and sequentially: | ||
```{r} | ||
data_frame(x = 1:5, y = x ^ 2) | ||
``` | ||
* It adds the `tbl_df()` class to the output so that if you accidentally print a large | ||
data frame you only get the first few rows. | ||
```{r} | ||
data_frame(x = 1:5) %>% class() | ||
``` | ||
* It changes the behaviour of `[` to always return the same type of object: | ||
subsetting using `[` always returns a `tbl_df()` object; subsetting using | ||
`[[` always returns a column. | ||
You should be aware of one case where subsetting a `tbl_df()` object | ||
will produce a different result than a `data.frame()` object: | ||
```{r} | ||
df <- data.frame(a = 1:2, b = 1:2) | ||
str(df[, "a"]) | ||
tbldf <- tbl_df(df) | ||
str(tbldf[, "a"]) | ||
``` | ||
* It never uses `row.names()`. The whole point of tidy data is to | ||
store variables in a consistent way. So it never stores a variable as | ||
special attribute. | ||
* It only recycles vectors of length 1. This is because recycling vectors of greater lengths | ||
is a frequent source of bugs. | ||
## Coercion | ||
To complement `data_frame()`, dplyr provides `as_data_frame()` to coerce lists into data frames. It does two things: | ||
* It checks that the input list is valid for a data frame, i.e. that each element | ||
is named, is a 1d atomic vector or list, and all elements have the same | ||
length. | ||
* It sets the class and attributes of the list to make it behave like a data frame. | ||
This modification does not require a deep copy of the input list, so it's | ||
very fast. | ||
This is much simpler than `as.data.frame()`. It's hard to explain precisely what `as.data.frame()` does, but it's similar to `do.call(cbind, lapply(x, data.frame))` - i.e. it coerces each component to a data frame and then `cbinds()` them all together. Consequently `as_data_frame()` is much faster than `as.data.frame()`: | ||
```{r} | ||
l2 <- replicate(26, sample(100), simplify = FALSE) | ||
names(l2) <- letters | ||
microbenchmark::microbenchmark( | ||
as_data_frame(l2), | ||
as.data.frame(l2) | ||
) | ||
``` | ||
|
||
The speed of `as.data.frame()` is not usually a bottleneck when used interactively, but can be a problem when combining thousands of messy inputs into one tidy data frame. | ||
|
||
## tbl_dfs vs data.frames | ||
|
||
There are three key differences between tbl_dfs and data.frames: | ||
|
||
* When you print a tbl_df, it only shows the first ten rows and all the | ||
columns that fit on one screen. It also prints an abbreviated description | ||
of the column type: | ||
|
||
```{r} | ||
data_frame(x = 1:1000) | ||
``` | ||
You can control the default appearance with options: | ||
* `options(tibble.print_max = n, tibble.print_min = m)`: if more than `n` | ||
rows print `m` rows. Use `options(tibble.print_max = Inf)` to always | ||
show all rows. | ||
* `options(tibble.width = Inf)` will always print all columns, regardless | ||
of the width of the screen. | ||
* When you subset a tbl\_df with `[`, it always returns another tbl\_df. | ||
Contrast this with a data frame: sometimes `[` returns a data frame and | ||
sometimes it just returns a single column: | ||
```{r} | ||
df1 <- data.frame(x = 1:3, y = 3:1) | ||
class(df1[, 1:2]) | ||
class(df1[, 1]) | ||
df2 <- data_frame(x = 1:3, y = 3:1) | ||
class(df2[, 1:2]) | ||
class(df2[, 1]) | ||
``` | ||
To extract a single column it's use `[[` or `$`: | ||
```{r} | ||
class(df2[[1]]) | ||
class(df2$x) | ||
``` | ||
* When you extract a variable with `$`, tbl\_dfs never do partial | ||
matching. They'll throw an error if the column doesn't exist: | ||
```{r, error = TRUE} | ||
df <- data.frame(abc = 1) | ||
df$a | ||
df2 <- data_frame(abc = 1) | ||
df2$a | ||
``` |