Skip to content

Commit

Permalink
Merge pull request #38 from krlmlr/feature/vignette
Browse files Browse the repository at this point in the history
- Include vignette (#38).
  • Loading branch information
krlmlr committed Mar 10, 2016
2 parents ef8b090 + fd5fa0f commit 65f6d7d
Show file tree
Hide file tree
Showing 2 changed files with 164 additions and 1 deletion.
8 changes: 7 additions & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,14 @@ URL: https://github.com/krlmlr/tibble
BugReports: https://github.com/krlmlr/tibble/issues
Depends: R (>= 3.1.2)
Imports: methods, assertthat, utils, lazyeval (>= 0.1.10), Rcpp
Suggests: testthat, knitr, Lahman (>= 3.0.1)
Suggests: testthat,
knitr,
rmarkdown,
Lahman (>= 3.0.1),
magrittr,
microbenchmark
LazyData: yes
License: MIT + file LICENSE
RoxygenNote: 5.0.1
LinkingTo: Rcpp
VignetteBuilder: knitr
157 changes: 157 additions & 0 deletions vignettes/data_frames.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
---
title: "Data frames"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Data frames}
%\VignetteEngine{knitr::rmarkdown}
\usepackage[utf8]{inputenc}
---

```{r, echo = FALSE, message = FALSE}
knitr::opts_chunk$set(collapse = T, comment = "#>")
options(tibble.print_min = 4L, tibble.print_max = 4L)
library(magrittr)
library(tibble)
```

## Creating

`data_frame()` is a nice way to create data frames. It encapsulates best practices for data frames:

* It never changes an input's type (i.e., no more `stringsAsFactors = FALSE`!).

```{r}
data.frame(x = letters) %>% sapply(class)
data_frame(x = letters) %>% sapply(class)
```
This makes it easier to use with list-columns:
```{r}
data_frame(x = 1:3, y = list(1:5, 1:10, 1:20))
```
List-columns are most commonly created by `do()`, but they can be useful to
create by hand.
* It never adjusts the names of variables:
```{r}
data.frame(`crazy name` = 1) %>% names()
data_frame(`crazy name` = 1) %>% names()
```
* It evaluates its arguments lazily and sequentially:
```{r}
data_frame(x = 1:5, y = x ^ 2)
```
* It adds the `tbl_df()` class to the output so that if you accidentally print a large
data frame you only get the first few rows.
```{r}
data_frame(x = 1:5) %>% class()
```
* It changes the behaviour of `[` to always return the same type of object:
subsetting using `[` always returns a `tbl_df()` object; subsetting using
`[[` always returns a column.
You should be aware of one case where subsetting a `tbl_df()` object
will produce a different result than a `data.frame()` object:
```{r}
df <- data.frame(a = 1:2, b = 1:2)
str(df[, "a"])
tbldf <- tbl_df(df)
str(tbldf[, "a"])
```
* It never uses `row.names()`. The whole point of tidy data is to
store variables in a consistent way. So it never stores a variable as
special attribute.
* It only recycles vectors of length 1. This is because recycling vectors of greater lengths
is a frequent source of bugs.
## Coercion
To complement `data_frame()`, dplyr provides `as_data_frame()` to coerce lists into data frames. It does two things:
* It checks that the input list is valid for a data frame, i.e. that each element
is named, is a 1d atomic vector or list, and all elements have the same
length.
* It sets the class and attributes of the list to make it behave like a data frame.
This modification does not require a deep copy of the input list, so it's
very fast.
This is much simpler than `as.data.frame()`. It's hard to explain precisely what `as.data.frame()` does, but it's similar to `do.call(cbind, lapply(x, data.frame))` - i.e. it coerces each component to a data frame and then `cbinds()` them all together. Consequently `as_data_frame()` is much faster than `as.data.frame()`:
```{r}
l2 <- replicate(26, sample(100), simplify = FALSE)
names(l2) <- letters
microbenchmark::microbenchmark(
as_data_frame(l2),
as.data.frame(l2)
)
```

The speed of `as.data.frame()` is not usually a bottleneck when used interactively, but can be a problem when combining thousands of messy inputs into one tidy data frame.

## tbl_dfs vs data.frames

There are three key differences between tbl_dfs and data.frames:

* When you print a tbl_df, it only shows the first ten rows and all the
columns that fit on one screen. It also prints an abbreviated description
of the column type:

```{r}
data_frame(x = 1:1000)
```
You can control the default appearance with options:
* `options(tibble.print_max = n, tibble.print_min = m)`: if more than `n`
rows print `m` rows. Use `options(tibble.print_max = Inf)` to always
show all rows.
* `options(tibble.width = Inf)` will always print all columns, regardless
of the width of the screen.
* When you subset a tbl\_df with `[`, it always returns another tbl\_df.
Contrast this with a data frame: sometimes `[` returns a data frame and
sometimes it just returns a single column:
```{r}
df1 <- data.frame(x = 1:3, y = 3:1)
class(df1[, 1:2])
class(df1[, 1])
df2 <- data_frame(x = 1:3, y = 3:1)
class(df2[, 1:2])
class(df2[, 1])
```
To extract a single column it's use `[[` or `$`:
```{r}
class(df2[[1]])
class(df2$x)
```
* When you extract a variable with `$`, tbl\_dfs never do partial
matching. They'll throw an error if the column doesn't exist:
```{r, error = TRUE}
df <- data.frame(abc = 1)
df$a
df2 <- data_frame(abc = 1)
df2$a
```

0 comments on commit 65f6d7d

Please sign in to comment.