defaults-short-and-sweet.qmd

# Keep defaults short and sweet {#sec-defaults-short-and-sweet}

```{r}
#| include = FALSE
source("common.R")
```

```{r}
#| eval = FALSE,
#| include = FALSE
source("fun_def.R")
pkg_funs("base") |> keep(\(f) some(f$formals, is.null))
pkg_funs("base") |> keep(\(f) some(f$formals, \(arg) is_call(arg) && !is_call(arg, c("c", "getOption", "c", "if", "topenv", "parent.frame"))))


funs <- c(pkg_funs("base"), pkg_funs("stats"))
arg_length <- function(x) map_int(x$formals, ~ nchar(expr_text(.x)))
args <- map(funs, arg_length)
args_max <- map_dbl(args, ~ if (length(.x) == 0) 0 else max(.x))

funs[args_max > 50] %>% discard(~ grepl("as.data.frame", .x$name, fixed = TRUE))
```

## What's the pattern?

Default values should be short and sweet.
Avoid large or complex calculations in the default values, instead using `NULL` or a helper function when the default requires complex calculation.
This keeps the function specification focussed on the big picture (i.e. what are the arguments and are they required or not) rather than the details of the defaults.

## What are some examples?

It's common for functions to use `NULL` to mean that the argument is optional, but the computation of the default is non-trivial:

-   The default `label` in `cut()` yields labels in the form `[a, b)`.
-   The default `pattern` in `dir()` means match all files.
-   The default `by` in `dplyr::left_join()` means join using the common variables between the two data frames (the so-called natural join).
-   The default `mapping` in `ggplot2::geom_point()` (and friends) means use the mapping from in the overall plot.

In other cases, we encapsulate default values into a function:

-   readr functions use a family of functions including `readr::show_progress()`, `readr::should_show_col_types()` and `readr::should_show_lazy()` that make it easier for users to override various defaults.

It's also worth looking at a couple of counter examples that come from base R:

-   The default value for `by` in `seq` is `((to - from)/(length.out - 1))`.

-   `reshape()` has a very long default argument: the `split` argument is one of two possible lists depending on the value of the `sep` argument:

    ```{r}
    #| eval = FALSE
    reshape <- function(
        ...,
        split = if (sep == "") {
          list(regexp = "[A-Za-z][0-9]", include = TRUE)
        } else {
          list(regexp = sep, include = FALSE, fixed = TRUE)
        }
    ) {}
    ```

-   `sample.int()` uses a complicated rule to determine whether or not to use a faster hash based method that's only applicable in some circumstances: `useHash = (!replace && is.null(prob) && size <= n/2 && n > 1e+07))`.

## How do I use it?

So what should you do if a default requires some complex calculation?
We have two recommended approaches: using `NULL` or creating a helper function.
I'll also show you two other alternatives which we don't generally recommend but you'll see in a handful of places in the tidyverse, and can be useful in limited circumstances.

### `NULL` default

The simplest, and most common, way to indicate that an argument is optional, but has a complex default is to use `NULL` as the default.
Then in the body of the function you perform the actual calculation only if the is `NULL`.
For example, if we were to use this approach in `sample.int()`, it might look something like this:

```{r}
sample.int <- function (n, size = n, replace = FALSE, prob = NULL, useHash = NULL)  {
  if (is.null(useHash)) {
    useHash <- n > 1e+07 && !replace && is.null(prob) && size <= n/2
  }
}
```

This pattern is made more elegant with the infix `%||%` operator which is built in to R 4.4.
If you need it in an older version of R you can import it from rlang or copy and paste it in to your `utils.R`:

```{r}
`%||%` <- function(x, y) if (is.null(x)) y else x

sample.int <- function (n, size = n, replace = FALSE, prob = NULL, useHash = NULL)  {
  useHash <- useHash %||% n > 1e+07 && !replace && is.null(prob) && size <= n/2
}
```

`%||%` is particularly well suited to arguments where the default value is found through a cascading system of fallbacks.
For example, this code from `ggplot2::geom_bar()` finds the width by first looking at the data, then in the parameters, finally falling back to computing it from the resolution of the `x` variable:

```{r}
#| eval = FALSE
width <- data$width %||% params$width %||% (resolution(data$x, FALSE) * 0.9)
```

Don't use `%||%` for more complex examples where the individual clauses can't fit on their own line.
For example in `reshape()`, I wouldn't write:

```{r}
#| eval: false
reshape <- function(..., sep = ".", split = NULL) {
  split <- split %||% if (sep == "") {
    list(regexp = "[A-Za-z][0-9]", include = TRUE)
  } else {
    list(regexp = sep, include = FALSE, fixed = TRUE)
  }  
  ...
}
```

I would instead use `is.null()` and assign `split` inside each branch:

```{r}
#| eval: false
reshape <- function(..., sep = ".", split = NULL) {
  if (is.null(split)) {
    if (sep == "") {
      split <- list(regexp = "[A-Za-z][0-9]", include = TRUE)
    } else {
      split <- list(regexp = sep, include = FALSE, fixed = TRUE)
    }
  }
  ...
}
```

Or alternatively you might pull the code out into a helper function:

```{r}
split_default <- function(sep = ".") {
 if (sep == "") {
    list(regexp = "[A-Za-z][0-9]", include = TRUE)
  } else {
    list(regexp = sep, include = FALSE, fixed = TRUE)
  }
}

reshape <- function(..., sep = ".", split = NULL) {
  split <- split %||% split_default(sep)
  ...
}
```

That makes it very clear exactly which other arguments the default for `split` depends on.

### Exported helper function

If you have created a helper function for your own use, might consider use it as the default:

```{r}
reshape <- function(..., sep = ".", split = split_default(sep)) {
  ...
}
```

The problem with using an internal function as the default is that the user can't easily run this function to see what it does, making the default a bit magical (@sec-def-magical).
So we recommend that if you want to do this you export and document that function.
This is the main downside of this approach: you have to think carefully about the name of the function because it's user facing.

A good example of this pattern is `readr::show_progress()`: it's used in every `read_` function in readr to determine whether or not a progress bar should be shown.
Because it has a relatively complex explanation, it's nice to be able to document it in its own file, rather than cluttering up file reading functions with incidental details.

### Alternatives

If the above techniques don't work for your case there are two other alternatives that we don't generally recommend but can be useful in limited situations.

::: {.callout-note collapse="true"}
#### Sentinel value

Sometimes you'd like to use the `NULL` approach defined above, but `NULL` already has a specific meaning that you want to preserve.
For example, this comes up in ggplot2 scales functions which allow you to set the `name` of the scale which is displayed on the axis or legend.
The default value should just preserve whatever existing label is present so that if you're providing a scale to customise (e.g.) the breaks or labels, you don't need to re-type the scale name.
However, `NULL` is also a meaningful value because it means eliminate the scale label altogether[^defaults-short-and-sweet-1].
For that reason the default value for `name` is `ggplot2::waiver()` a ggplot2-specific convention that means "inherit from the existing value".

If you look at `ggplot2::waiver()` you'll see it's just a very lightweight S3 class[^defaults-short-and-sweet-2]:

```{r}
ggplot2::waiver
```

And then ggplot2 also provides the internal `is.waive()`[^defaults-short-and-sweet-3] function which allows to work with it in the same way we might work with a `NULL`:

```{r}
is.waive <- function(x) {
  inherits(x, "waiver")
}
```

The primary downside of this technique is that it requires substantial infrastructure to set up, so it's only really worth it for very important functions or if you're going to use it in multiple places.
:::

[^defaults-short-and-sweet-1]: Unlike `name = ""` which doesn't show the label, but preserves the space where it would appear (sometimes useful for aligning multiple plots), `name = NULL` also eliminates the space normally allocated for the label.

[^defaults-short-and-sweet-2]: If I was to write this code today I'd use `ggplot2_waiver` as the class name.

[^defaults-short-and-sweet-3]: If I wrote this code today, I'd call it `is_waiver()`.

::: {.callout-warning collapse="true"}
#### No default

The final alternative is to condition on the absence of an argument using `missing().` It works something like this:

```{r}
reshape <- function(..., sep = ".", split) {
  if (missing(split)) {
    split <- split_default(sep)
  }
  ...
}
```

I mention this technique because we used it in `purrr::reduce()` for the `.init` argument.
This argument is mostly optional:

```{r}
library(purrr)
reduce(letters[1:3], paste)
reduce(letters[1:2], paste)
reduce(letters[1], paste)
```

But it is required when `.x` (the first argument) is empty, and it's good practice to supply it when wrapping `reduce()` inside another function because it ensures that you get the right type of output for all inputs:

```{r}
#| error: true
reduce(letters[0], paste)
reduce(letters[0], paste, .init = "")
```

Why use this approach?
`NULL` is a potentially valid option for `.init`, so we can't use that approach.
And we only need it for a single function, that's not terribly important, so creating a sentinel didn't seem to worth it.
`.init` is "semi" required so this seemed to be the least worst solution to the problem.

The major drawback to this technique is that it makes it look like an argument is required (in direct conflict with @sec-required-no-defaults).
:::

## How do I remediate existing problems?

If you have a function with a long default, you can remediate it with any of the approaches.
It won't be a breaking change unless you accidentally change the computation of the default, so make sure you have a test for that before you begin.

## See also

-   See @sec-argument-clutter for a tecnhnique to simplify your function spec if its long because it has many less important optional arguments.