From b497da2435e5976a16a02a5758da4587c707e36c Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Kirill=20M=C3=BCller?= Date: Thu, 10 Mar 2016 20:08:57 +0100 Subject: [PATCH 1/4] copy part of dplyr's vignette --- DESCRIPTION | 4 +- vignettes/data_frames.Rmd | 156 ++++++++++++++++++++++++++++++++++++++ 2 files changed, 159 insertions(+), 1 deletion(-) create mode 100644 vignettes/data_frames.Rmd diff --git a/DESCRIPTION b/DESCRIPTION index 027627ebe..3392a6d98 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -15,8 +15,10 @@ URL: https://github.com/krlmlr/tibble BugReports: https://github.com/krlmlr/tibble/issues Depends: R (>= 3.1.2) Imports: methods, assertthat, utils, lazyeval (>= 0.1.10), Rcpp -Suggests: testthat, knitr, Lahman (>= 3.0.1) +Suggests: testthat, knitr, Lahman (>= 3.0.1), + rmarkdown LazyData: yes License: MIT + file LICENSE RoxygenNote: 5.0.1 LinkingTo: Rcpp +VignetteBuilder: knitr diff --git a/vignettes/data_frames.Rmd b/vignettes/data_frames.Rmd new file mode 100644 index 000000000..c57100d15 --- /dev/null +++ b/vignettes/data_frames.Rmd @@ -0,0 +1,156 @@ +--- +title: "Data frames" +date: "`r Sys.Date()`" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{Data frames} + %\VignetteEngine{knitr::rmarkdown} + \usepackage[utf8]{inputenc} +--- + +```{r, echo = FALSE, message = FALSE} +knitr::opts_chunk$set(collapse = T, comment = "#>") +options(tibble.print_min = 4L, tibble.print_max = 4L) +library(dplyr) +``` + +## Creating + +`data_frame()` is a nice way to create data frames. It encapsulates best practices for data frames: + + * It never changes an input's type (i.e., no more `stringsAsFactors = FALSE`!). + + ```{r} + data.frame(x = letters) %>% sapply(class) + data_frame(x = letters) %>% sapply(class) + ``` + + This makes it easier to use with list-columns: + + ```{r} + data_frame(x = 1:3, y = list(1:5, 1:10, 1:20)) + ``` + + List-columns are most commonly created by `do()`, but they can be useful to + create by hand. + + * It never adjusts the names of variables: + + ```{r} + data.frame(`crazy name` = 1) %>% names() + data_frame(`crazy name` = 1) %>% names() + ``` + + * It evaluates its arguments lazily and sequentially: + + ```{r} + data_frame(x = 1:5, y = x ^ 2) + ``` + + * It adds the `tbl_df()` class to the output so that if you accidentally print a large + data frame you only get the first few rows. + + ```{r} + data_frame(x = 1:5) %>% class() + ``` + + * It changes the behaviour of `[` to always return the same type of object: + subsetting using `[` always returns a `tbl_df()` object; subsetting using + `[[` always returns a column. + + You should be aware of one case where subsetting a `tbl_df()` object + will produce a different result than a `data.frame()` object: + + ```{r} + df <- data.frame(a = 1:2, b = 1:2) + str(df[, "a"]) + + tbldf <- tbl_df(df) + str(tbldf[, "a"]) + ``` + + * It never uses `row.names()`. The whole point of tidy data is to + store variables in a consistent way. So it never stores a variable as + special attribute. + + * It only recycles vectors of length 1. This is because recycling vectors of greater lengths + is a frequent source of bugs. + +## Coercion + +To complement `data_frame()`, dplyr provides `as_data_frame()` to coerce lists into data frames. It does two things: + +* It checks that the input list is valid for a data frame, i.e. that each element + is named, is a 1d atomic vector or list, and all elements have the same + length. + +* It sets the class and attributes of the list to make it behave like a data frame. + This modification does not require a deep copy of the input list, so it's + very fast. + +This is much simpler than `as.data.frame()`. It's hard to explain precisely what `as.data.frame()` does, but it's similar to `do.call(cbind, lapply(x, data.frame))` - i.e. it coerces each component to a data frame and then `cbinds()` them all together. Consequently `as_data_frame()` is much faster than `as.data.frame()`: + +```{r} +l2 <- replicate(26, sample(100), simplify = FALSE) +names(l2) <- letters +microbenchmark::microbenchmark( + as_data_frame(l2), + as.data.frame(l2) +) +``` + +The speed of `as.data.frame()` is not usually a bottleneck when used interactively, but can be a problem when combining thousands of messy inputs into one tidy data frame. + +## tbl_dfs vs data.frames + +There are three key differences between tbl_dfs and data.frames: + +* When you print a tbl_df, it only shows the first ten rows and all the + columns that fit on one screen. It also prints an abbreviated description + of the column type: + + ```{r} + data_frame(x = 1:1000) + ``` + + You can control the default appearance with options: + + * `options(tibble.print_max = n, tibble.print_min = m)`: if more than `n` + rows print `m` rows. Use `options(tibble.print_max = Inf)` to always + show all rows. + + * `options(tibble.width = Inf)` will always print all columns, regardless + of the width of the screen. + + +* When you subset a tbl\_df with `[`, it always returns another tbl\_df. + Contrast this with a data frame: sometimes `[` returns a data frame and + sometimes it just returns a single column: + + ```{r} + df1 <- data.frame(x = 1:3, y = 3:1) + class(df1[, 1:2]) + class(df1[, 1]) + + df2 <- data_frame(x = 1:3, y = 3:1) + class(df2[, 1:2]) + class(df2[, 1]) + ``` + + To extract a single column it's use `[[` or `$`: + + ```{r} + class(df2[[1]]) + class(df2$x) + ``` + +* When you extract a variable with `$`, tbl\_dfs never do partial + matching. They'll throw an error if the column doesn't exist: + + ```{r, error = TRUE} + df <- data.frame(abc = 1) + df$a + + df2 <- data_frame(abc = 1) + df2$a + ``` From f0681f77b49555a909647aa2f990fd2962a1a327 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Kirill=20M=C3=BCller?= Date: Thu, 10 Mar 2016 20:12:08 +0100 Subject: [PATCH 2/4] tibble --- vignettes/data_frames.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vignettes/data_frames.Rmd b/vignettes/data_frames.Rmd index c57100d15..24c31021b 100644 --- a/vignettes/data_frames.Rmd +++ b/vignettes/data_frames.Rmd @@ -11,7 +11,7 @@ vignette: > ```{r, echo = FALSE, message = FALSE} knitr::opts_chunk$set(collapse = T, comment = "#>") options(tibble.print_min = 4L, tibble.print_max = 4L) -library(dplyr) +library(tibble) ``` ## Creating From e2e9111589335fab35b17cfa24d63fe52e6fd8a9 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Kirill=20M=C3=BCller?= Date: Thu, 10 Mar 2016 20:17:33 +0100 Subject: [PATCH 3/4] need magrittr --- DESCRIPTION | 7 +++++-- vignettes/data_frames.Rmd | 1 + 2 files changed, 6 insertions(+), 2 deletions(-) diff --git a/DESCRIPTION b/DESCRIPTION index 3392a6d98..62fdf94a2 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -15,8 +15,11 @@ URL: https://github.com/krlmlr/tibble BugReports: https://github.com/krlmlr/tibble/issues Depends: R (>= 3.1.2) Imports: methods, assertthat, utils, lazyeval (>= 0.1.10), Rcpp -Suggests: testthat, knitr, Lahman (>= 3.0.1), - rmarkdown +Suggests: testthat, + knitr, + rmarkdown, + Lahman (>= 3.0.1), + magrittr LazyData: yes License: MIT + file LICENSE RoxygenNote: 5.0.1 diff --git a/vignettes/data_frames.Rmd b/vignettes/data_frames.Rmd index 24c31021b..116cac422 100644 --- a/vignettes/data_frames.Rmd +++ b/vignettes/data_frames.Rmd @@ -11,6 +11,7 @@ vignette: > ```{r, echo = FALSE, message = FALSE} knitr::opts_chunk$set(collapse = T, comment = "#>") options(tibble.print_min = 4L, tibble.print_max = 4L) +library(magrittr) library(tibble) ``` From fd5fa0fa54c0002faf7f834249f5c95fff0162dd Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Kirill=20M=C3=BCller?= Date: Thu, 10 Mar 2016 20:44:27 +0100 Subject: [PATCH 4/4] microbenchmark --- DESCRIPTION | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/DESCRIPTION b/DESCRIPTION index 62fdf94a2..d10ff98e2 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -19,7 +19,8 @@ Suggests: testthat, knitr, rmarkdown, Lahman (>= 3.0.1), - magrittr + magrittr, + microbenchmark LazyData: yes License: MIT + file LICENSE RoxygenNote: 5.0.1