Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include vignette #38

Merged
merged 5 commits into from
Mar 10, 2016
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,14 @@ URL: https://github.com/krlmlr/tibble
BugReports: https://github.com/krlmlr/tibble/issues
Depends: R (>= 3.1.2)
Imports: methods, assertthat, utils, lazyeval (>= 0.1.10), Rcpp
Suggests: testthat, knitr, Lahman (>= 3.0.1)
Suggests: testthat,
knitr,
rmarkdown,
Lahman (>= 3.0.1),
magrittr,
microbenchmark
LazyData: yes
License: MIT + file LICENSE
RoxygenNote: 5.0.1
LinkingTo: Rcpp
VignetteBuilder: knitr
157 changes: 157 additions & 0 deletions vignettes/data_frames.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
---
title: "Data frames"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Data frames}
%\VignetteEngine{knitr::rmarkdown}
\usepackage[utf8]{inputenc}
---

```{r, echo = FALSE, message = FALSE}
knitr::opts_chunk$set(collapse = T, comment = "#>")
options(tibble.print_min = 4L, tibble.print_max = 4L)
library(magrittr)
library(tibble)
```

## Creating

`data_frame()` is a nice way to create data frames. It encapsulates best practices for data frames:

* It never changes an input's type (i.e., no more `stringsAsFactors = FALSE`!).

```{r}
data.frame(x = letters) %>% sapply(class)
data_frame(x = letters) %>% sapply(class)
```

This makes it easier to use with list-columns:

```{r}
data_frame(x = 1:3, y = list(1:5, 1:10, 1:20))
```

List-columns are most commonly created by `do()`, but they can be useful to
create by hand.

* It never adjusts the names of variables:

```{r}
data.frame(`crazy name` = 1) %>% names()
data_frame(`crazy name` = 1) %>% names()
```

* It evaluates its arguments lazily and sequentially:

```{r}
data_frame(x = 1:5, y = x ^ 2)
```

* It adds the `tbl_df()` class to the output so that if you accidentally print a large
data frame you only get the first few rows.

```{r}
data_frame(x = 1:5) %>% class()
```

* It changes the behaviour of `[` to always return the same type of object:
subsetting using `[` always returns a `tbl_df()` object; subsetting using
`[[` always returns a column.

You should be aware of one case where subsetting a `tbl_df()` object
will produce a different result than a `data.frame()` object:

```{r}
df <- data.frame(a = 1:2, b = 1:2)
str(df[, "a"])

tbldf <- tbl_df(df)
str(tbldf[, "a"])
```

* It never uses `row.names()`. The whole point of tidy data is to
store variables in a consistent way. So it never stores a variable as
special attribute.

* It only recycles vectors of length 1. This is because recycling vectors of greater lengths
is a frequent source of bugs.

## Coercion

To complement `data_frame()`, dplyr provides `as_data_frame()` to coerce lists into data frames. It does two things:

* It checks that the input list is valid for a data frame, i.e. that each element
is named, is a 1d atomic vector or list, and all elements have the same
length.

* It sets the class and attributes of the list to make it behave like a data frame.
This modification does not require a deep copy of the input list, so it's
very fast.

This is much simpler than `as.data.frame()`. It's hard to explain precisely what `as.data.frame()` does, but it's similar to `do.call(cbind, lapply(x, data.frame))` - i.e. it coerces each component to a data frame and then `cbinds()` them all together. Consequently `as_data_frame()` is much faster than `as.data.frame()`:

```{r}
l2 <- replicate(26, sample(100), simplify = FALSE)
names(l2) <- letters
microbenchmark::microbenchmark(
as_data_frame(l2),
as.data.frame(l2)
)
```

The speed of `as.data.frame()` is not usually a bottleneck when used interactively, but can be a problem when combining thousands of messy inputs into one tidy data frame.

## tbl_dfs vs data.frames

There are three key differences between tbl_dfs and data.frames:

* When you print a tbl_df, it only shows the first ten rows and all the
columns that fit on one screen. It also prints an abbreviated description
of the column type:

```{r}
data_frame(x = 1:1000)
```

You can control the default appearance with options:

* `options(tibble.print_max = n, tibble.print_min = m)`: if more than `n`
rows print `m` rows. Use `options(tibble.print_max = Inf)` to always
show all rows.

* `options(tibble.width = Inf)` will always print all columns, regardless
of the width of the screen.


* When you subset a tbl\_df with `[`, it always returns another tbl\_df.
Contrast this with a data frame: sometimes `[` returns a data frame and
sometimes it just returns a single column:

```{r}
df1 <- data.frame(x = 1:3, y = 3:1)
class(df1[, 1:2])
class(df1[, 1])

df2 <- data_frame(x = 1:3, y = 3:1)
class(df2[, 1:2])
class(df2[, 1])
```

To extract a single column it's use `[[` or `$`:

```{r}
class(df2[[1]])
class(df2$x)
```

* When you extract a variable with `$`, tbl\_dfs never do partial
matching. They'll throw an error if the column doesn't exist:

```{r, error = TRUE}
df <- data.frame(abc = 1)
df$a

df2 <- data_frame(abc = 1)
df2$a
```