Merge pull request #38 from krlmlr/feature/vignette

- Include vignette (#38).
tidyverse · Mar 10, 2016 · 65f6d7d · 65f6d7d
2 parents ef8b090 + fd5fa0f
commit 65f6d7d
Show file tree

Hide file tree

Showing 2 changed files with 164 additions and 1 deletion.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -15,8 +15,14 @@ URL: https://github.com/krlmlr/tibble
 BugReports: https://github.com/krlmlr/tibble/issues
 Depends: R (>= 3.1.2)
 Imports: methods, assertthat, utils, lazyeval (>= 0.1.10), Rcpp
-Suggests: testthat, knitr, Lahman (>= 3.0.1)
+Suggests: testthat,
+    knitr,
+    rmarkdown,
+    Lahman (>= 3.0.1),
+    magrittr,
+    microbenchmark
 LazyData: yes
 License: MIT + file LICENSE
 RoxygenNote: 5.0.1
 LinkingTo: Rcpp
+VignetteBuilder: knitr
diff --git a/vignettes/data_frames.Rmd b/vignettes/data_frames.Rmd
@@ -0,0 +1,157 @@
+---
+title: "Data frames"
+date: "`r Sys.Date()`"
+output: rmarkdown::html_vignette
+vignette: >
+  %\VignetteIndexEntry{Data frames}
+  %\VignetteEngine{knitr::rmarkdown}
+  \usepackage[utf8]{inputenc}
+---
+
+```{r, echo = FALSE, message = FALSE}
+knitr::opts_chunk$set(collapse = T, comment = "#>")
+options(tibble.print_min = 4L, tibble.print_max = 4L)
+library(magrittr)
+library(tibble)
+```
+
+## Creating
+
+`data_frame()` is a nice way to create data frames. It encapsulates best practices for data frames:
+
+  * It never changes an input's type (i.e., no more `stringsAsFactors = FALSE`!).
+
+    ```{r}
+    data.frame(x = letters) %>% sapply(class)
+    data_frame(x = letters) %>% sapply(class)
+    ```
+    
+    This makes it easier to use with list-columns:
+    
+    ```{r}
+    data_frame(x = 1:3, y = list(1:5, 1:10, 1:20))
+    ```
+    
+    List-columns are most commonly created by `do()`, but they can be useful to
+    create by hand.
+      
+  * It never adjusts the names of variables:
+  
+    ```{r}
+    data.frame(`crazy name` = 1) %>% names()
+    data_frame(`crazy name` = 1) %>% names()
+    ```
+
+  * It evaluates its arguments lazily and sequentially:
+  
+    ```{r}
+    data_frame(x = 1:5, y = x ^ 2)
+    ```
+
+  * It adds the `tbl_df()` class to the output so that if you accidentally print a large 
+    data frame you only get the first few rows.
+    
+    ```{r}
+    data_frame(x = 1:5) %>% class()
+    ```
+
+  * It changes the behaviour of `[` to always return the same type of object:
+    subsetting using `[` always returns a `tbl_df()` object; subsetting using 
+    `[[` always returns a column.
+    
+    You should be aware of one case where subsetting a `tbl_df()` object  
+    will produce a different result than a `data.frame()` object:
+  
+    ```{r}
+    df <- data.frame(a = 1:2, b = 1:2)
+    str(df[, "a"])
+    
+    tbldf <- tbl_df(df)
+    str(tbldf[, "a"])
+    ```
+    
+  * It never uses `row.names()`. The whole point of tidy data is to 
+    store variables in a consistent way. So it never stores a variable as 
+    special attribute.
+  
+  * It only recycles vectors of length 1. This is because recycling vectors of greater lengths 
+    is a frequent source of bugs.
+
+## Coercion
+
+To complement `data_frame()`, dplyr provides `as_data_frame()` to coerce lists into data frames. It does two things:
+
+* It checks that the input list is valid for a data frame, i.e. that each element
+  is named, is a 1d atomic vector or list, and all elements have the same 
+  length.
+  
+* It sets the class and attributes of the list to make it behave like a data frame.
+  This modification does not require a deep copy of the input list, so it's
+  very fast.
+  
+This is much simpler than `as.data.frame()`. It's hard to explain precisely what `as.data.frame()` does, but it's similar to `do.call(cbind, lapply(x, data.frame))` - i.e. it coerces each component to a data frame and then `cbinds()` them all together. Consequently `as_data_frame()` is much faster than `as.data.frame()`:
+
+```{r}
+l2 <- replicate(26, sample(100), simplify = FALSE)
+names(l2) <- letters
+microbenchmark::microbenchmark(
+  as_data_frame(l2),
+  as.data.frame(l2)
+)
+```
+
+The speed of `as.data.frame()` is not usually a bottleneck when used interactively, but can be a problem when combining thousands of messy inputs into one tidy data frame.
+
+## tbl_dfs vs data.frames
+
+There are three key differences between tbl_dfs and data.frames:
+
+*   When you print a tbl_df, it only shows the first ten rows and all the
+    columns that fit on one screen. It also prints an abbreviated description
+    of the column type:
+
+    ```{r}
+    data_frame(x = 1:1000)
+    ```
+    
+    You can control the default appearance with options:
+    
+    * `options(tibble.print_max = n, tibble.print_min = m)`: if more than `n`
+      rows print `m` rows. Use `options(tibble.print_max = Inf)` to always
+      show all rows.
+    
+    * `options(tibble.width = Inf)` will always print all columns, regardless
+       of the width of the screen.
+
+    
+*   When you subset a tbl\_df with `[`, it always returns another tbl\_df. 
+    Contrast this with a data frame: sometimes `[` returns a data frame and
+    sometimes it just returns a single column:
+    
+    ```{r}
+    df1 <- data.frame(x = 1:3, y = 3:1)
+    class(df1[, 1:2])
+    class(df1[, 1])
+    
+    df2 <- data_frame(x = 1:3, y = 3:1)
+    class(df2[, 1:2])
+    class(df2[, 1])
+    ```
+    
+    To extract a single column it's use `[[` or `$`:
+    
+    ```{r}
+    class(df2[[1]])
+    class(df2$x)
+    ```
+
+*   When you extract a variable with `$`, tbl\_dfs never do partial 
+    matching. They'll throw an error if the column doesn't exist:
+    
+    ```{r, error = TRUE}
+    df <- data.frame(abc = 1)
+    df$a
+    
+    df2 <- data_frame(abc = 1)
+    df2$a
+    ```