28  rlang: Tidy evaluation

Published

February 17, 2023

Modified

January 8, 2024

28.1 rlang Tidy evaluation vignettes

28.2 What is data-masking and why do I need {{?

“Data-masking is a distinctive feature of R whereby programming is performed directly on a data set, with columns defined as normal objects.” This is achieved by defusing R code (quoting it) and then injecting (unquoting) the defused expression in the correct context of the data frame instead of the user environment.

If you pass arguments to a data-masking function in a normal way such as dplyr::summarise() the arguments are defused, but the user-defined arguments are not injected. For instance, below my_mean() does not know to look for cyl and am in the mtcars data frame and instead looks for them in the user environment.

my_mean <- function(data, var1, var2) {
  summarise(data, mean(var1 + var2))
}

my_mean(mtcars, cyl, am)
#> Error in `summarise()`:
#> ℹ In argument: `mean(var1 + var2)`.
#> Caused by error:
#> ! object 'cyl' not found

If you introduce objects named cyl and am into the user environment, those will be used and the mtcars data frame will not be used at all. Notice that the column is named mean(var1 + var2), just like the error message above, indicating the actual code that is being run.

cyl <- 1
am <- 2
my_mean(mtcars, cyl, am)
#>   mean(var1 + var2)
#> 1                 3

To inject a function argument in a data masking context use the embracing syntax curly-curly ({{). Note that when this is done the column is correctly named mean(cyl + am) and the names properly refer to variables in the mtcars data frame and not the user environment.

my_mean <- function(data, var1, var2) {
  summarise(data, mean({{ var1 }} + {{ var2 }}))
}

my_mean(mtcars, cyl, am)
#>   mean(cyl + am)
#> 1        6.59375

28.2.1 What does “masking” mean?

Data masking occurs by placing the data frame at the bottom of the chain of environments so that it takes precedence over the user environment. It thus masks the user environment. This means that data masking functions will use a data frame variable instead of a variable in the user environment as in the above my_mean() function. Tidy eval provides .data and .env pronouns to help deal with this ambiguity.

mtcars  |> 
  summarise(
    mean_data = mean(.data$cyl),
    mean_env = mean(.env$cyl)
  )
#>   mean_data mean_env
#> 1    6.1875        1

28.2.2 How does data-masking work?

Data masking relies on three language features of R:

  1. Argument defusal
  2. First class environments: Environments are a special type of list-like object in which defused R code can be evaluated.
  3. Explicit evaluation with eval() (base) or eval_tidy() (rlang).

The below code brings these three features together: the code is defused (quoted) and then explicitly evaluated within the environment of the mtcars data frame instead of the default user environment.

code <- expr(mean(cyl + am))
eval(code, mtcars)
#> [1] 6.59375

28.3 Data mask programming patterns

There are two main considerations when determining which programming pattern should be used to wrap a data-masking function:

  1. What behavior does the wrapped function implement?
  2. What behavior should your function implement?

28.3.1 Argument behaviors

Data masking arguments are not only defined by the type of objects they accept but also the special computational behaviors they exhibit. Options include:

  • Base data-masked expressions (e.g. with()): Expressions may refer to the columns of the supplied data frame.
  • Tidy eval data-masked expressions: Same as base data-masked expressions but with addition features such as injection operators: {{ and !! and the .data and .env pronouns.
  • Data-masked symbols: Supplied expressions must be simple column names.
  • Tidy selections: Tidy selection is an alternative to data masking and does not involve masking. Expressions are either interpreted in the context of a data frame (c(cyl, am)) or evaluated in the user environment (starts_with()).
  • Dynamic dots: These may be data-masked arguments, tidy selections, or just regular arguments.

You can include documentation about the three main tidy eval options with the following tags:

  • @param foo <[`data-masked`][dplyr::dplyr_data_masking]> What `foo` does.
  • @param bar <[`tidy-select`][dplyr::dplyr_tidy_select]> What `bar` does.
  • @param ... <[`dynamic-dots`][rlang::dyn-dots]> What these dots do.

28.3.2 Forwarding patterns

Your function inherits the behavior of the function it interfaces with. In both data masking and tidy selection contexts use the embrace operator ({{).

my_summarise <- function(data, var) {
  data %>% dplyr::summarise({{ var }})
}
mtcars %>% my_summarise(mean(cyl))
#>   mean(cyl)
#> 1    6.1875

The behavior of my_summarise() is the same as dplyr::summarise(). This includes the ability to use the .data pronoun to refer to columns. The below both work in the same way.

x <- "cyl"
mtcars %>% dplyr::summarise(mean(.data[[x]]))
#>   mean(.data[["cyl"]])
#> 1               6.1875
mtcars %>% my_summarise(mean(.data[[x]]))
#>   mean(.data[["cyl"]])
#> 1               6.1875

Dots can be forwarded by simply passing them on to another argument.

my_group_by <- function(.data, ...) {
  .data %>% dplyr::group_by(...)
}

There are some tidy selection functions that use a single named argument instead of ... such as pivot_longer(). In that case, pass the ... inside c(), which acts as a selection combinator in this context.

my_pivot_longer <- function(.data, ...) {
  .data %>% tidyr::pivot_longer(c(...))
}

28.3.3 Names patterns

Your function takes strings or character vectors to refer to column names.

The .data pronoun is a tidy eval feature enabled within data-masked arguments and represents the data mask. It can be subset with [[ and $. The three below statements are equivalent just as above with my_summarise().

mtcars %>% dplyr::summarise(mean = mean(cyl))
#>     mean
#> 1 6.1875

mtcars %>% dplyr::summarise(mean = mean(.data$cyl))
#>     mean
#> 1 6.1875

var <- "cyl"
mtcars %>% dplyr::summarise(mean = mean(.data[[var]]))
#>     mean
#> 1 6.1875

You can also use the .data pronoun to connect function arguments to a data-variable. This insulates the function from data-masking behavior. Notice that my_mean() now needs a character vector and uses the environmental variable equivalent to "cyl" instead of the data variable of am.

my_mean <- function(data, var) {
  data %>% dplyr::summarise(mean = mean(.data[[var]]))
}

my_mean(mtcars, "cyl")
#>     mean
#> 1 6.1875

am <- "cyl"
my_mean(mtcars, am)
#>     mean
#> 1 6.1875

.data does not support character vectors of length greater than one. For character vectors of names greater than one use all_of() or any_of().

vars <- c("cyl", "am")
mtcars %>% tidyr::pivot_longer(all_of(vars))
#> # A tibble: 64 × 11
#>      mpg  disp    hp  drat    wt  qsec    vs  gear  carb name  value
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
#>  1  21     160   110  3.9   2.62  16.5     0     4     4 cyl       6
#>  2  21     160   110  3.9   2.62  16.5     0     4     4 am        1
#>  3  21     160   110  3.9   2.88  17.0     0     4     4 cyl       6
#>  4  21     160   110  3.9   2.88  17.0     0     4     4 am        1
#>  5  22.8   108    93  3.85  2.32  18.6     1     4     1 cyl       4
#>  6  22.8   108    93  3.85  2.32  18.6     1     4     1 am        1
#>  7  21.4   258   110  3.08  3.22  19.4     1     3     1 cyl       6
#>  8  21.4   258   110  3.08  3.22  19.4     1     3     1 am        0
#>  9  18.7   360   175  3.15  3.44  17.0     0     3     2 cyl       8
#> 10  18.7   360   175  3.15  3.44  17.0     0     3     2 am        0
#> # ℹ 54 more rows

28.3.4 Bridge patterns

You change the behavior of an argument instead of inheriting it.

You can use across() or pick() as a bridge between selection and data masking.

my_group_by <- function(data, cols) {
  group_by(data, pick({{ cols }}))
}

mtcars %>% my_group_by(starts_with("c"))
#> # A tibble: 32 × 11
#> # Groups:   cyl, carb [9]
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
#>  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
#>  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
#>  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
#>  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
#>  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
#>  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
#>  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
#>  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
#> 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
#> # ℹ 22 more rows

pick(), unlike across() takes dynamic dots, so you can also just pass on the dots. With across() you need to collect the dots with c(...).

my_group_by <- function(.data, ...) {
  group_by(.data, pick(...))
}

mtcars %>% my_group_by(starts_with("c"), vs:gear)
#> # A tibble: 32 × 11
#> # Groups:   cyl, carb, vs, am, gear [15]
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
#>  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
#>  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
#>  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
#>  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
#>  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
#>  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
#>  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
#>  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
#> 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
#> # ℹ 22 more rows

Use all_of() or any_of() to bridge names, or character vectors, to a data mask.

my_group_by <- function(data, vars) {
  data %>% dplyr::group_by(pick(all_of(vars)))
}

mtcars %>% my_group_by(c("cyl", "am"))
#> # A tibble: 32 × 11
#> # Groups:   cyl, am [6]
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
#>  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
#>  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
#>  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
#>  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
#>  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
#>  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
#>  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
#>  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
#> 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
#> # ℹ 22 more rows

Use mutate(.keep = "none") to bridge data-mask to selection. This pattern is a little trickier and uses mutate() as a way to inspect the names passed to ... and make sure that they are included in the data frame. The column names and any transformation are done with the splice operator (!!!) and then the selection can be passed to pivot_longer() through all_of(). For the output, look to the columns on the right that show name and value. For another way to do this using a symbolize and inject pattern, see Metaprogramming patterns - Bridge patterns.

my_pivot_longer <- function(data, ...) {
  # Forward `...` in data-mask context with `mutate(.keep = "none")`
  # to create a new data frame and save the inputs names
  inputs <- dplyr::mutate(data, ..., .keep = "none")
  names <- names(inputs)
  
  # Update the data with the inputs
  data <- dplyr::mutate(data, !!!inputs)

  # Select the inputs by name with `all_of()`
  tidyr::pivot_longer(data, cols = all_of(names))
}

mtcars %>% my_pivot_longer(cyl, am = am * 100)
#> # A tibble: 64 × 11
#>      mpg  disp    hp  drat    wt  qsec    vs  gear  carb name  value
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
#>  1  21     160   110  3.9   2.62  16.5     0     4     4 cyl       6
#>  2  21     160   110  3.9   2.62  16.5     0     4     4 am      100
#>  3  21     160   110  3.9   2.88  17.0     0     4     4 cyl       6
#>  4  21     160   110  3.9   2.88  17.0     0     4     4 am      100
#>  5  22.8   108    93  3.85  2.32  18.6     1     4     1 cyl       4
#>  6  22.8   108    93  3.85  2.32  18.6     1     4     1 am      100
#>  7  21.4   258   110  3.08  3.22  19.4     1     3     1 cyl       6
#>  8  21.4   258   110  3.08  3.22  19.4     1     3     1 am        0
#>  9  18.7   360   175  3.15  3.44  17.0     0     3     2 cyl       8
#> 10  18.7   360   175  3.15  3.44  17.0     0     3     2 am        0
#> # ℹ 54 more rows

28.3.5 Transformation patterns

You can transform inputs with across() by forwarding ... to across() and performing an action on it. This uses ... to inherit tidy selection behavior. For another way to do this using a symbolize and inject pattern, see Metaprogramming patterns - Transformation patterns.

my_mean <- function(data, ...) {
  data %>%  dplyr::summarise(
    across(c(...), ~ mean(.x, na.rm = TRUE))
    )
}

mtcars %>% my_mean(cyl, carb)
#>      cyl   carb
#> 1 6.1875 2.8125

mtcars %>% my_mean(foo = cyl, bar = carb)
#>      foo    bar
#> 1 6.1875 2.8125

mtcars %>% my_mean(starts_with("c"), mpg:disp)
#>      cyl   carb      mpg     disp
#> 1 6.1875 2.8125 20.09062 230.7219

filter() necessitates a different pattern because it is built on logical expressions. if_all() and if_any() provide variants of across() suitable to use in filter. For instance, creating a function to filter all rows for which a set of variables are not equal to their minimum value.

filter_non_baseline <- function(.data, ...) {
  .data %>% dplyr::filter(if_all(c(...), ~ .x != min(.x, na.rm = TRUE)))
}

mtcars %>% filter_non_baseline(vs, am, gear)
#>                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
#> Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
#> Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
#> Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
#> Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
#> Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
#> Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

28.4 The data mask ambiguity

The convenience of data masking makes it possible to refer to both columns in data frames and objects in the user environment. However, this convenience introduces ambiguity.

For instance, which value of x is being referred to in the mutate() function. The problem occurs when you want to use an object from the user environment but there is a column with the same name.

df <- data.frame(x = NA, y = 2)
x <- 100

df %>% dplyr::mutate(y = y / x)
#>    x  y
#> 1 NA NA

Another issue occurs when you have a typo in a data-variable name or you were expecting a column that is missing and there is an object with that name in the user environment. In a data-masking context if a variable cannot be found in the data mask, R looks for variables in the surrounding environment.

df <- data.frame(foo = "right")
ffo <- "wrong"

df %>% dplyr::mutate(foo = toupper(ffo))
#>     foo
#> 1 WRONG

28.4.1 Preventing collisions

The .data and .env pronouns

The easiest solution to disambiguate between data-variables and environment-variables is to use the .data and .env pronouns.

df <- data.frame(x = 1, y = 2)
x <- 100

df %>% dplyr::mutate(y = .data$y / .env$x)
#>   x    y
#> 1 1 0.02

This is particularly useful when using named arguments with values in a function to avoid name conflicts with data frames. Use the .env pronoun for any environment variables scoped in the function to avoid hitting a masking column. The below example shows how the factor column is given preference over the argument factor in a data-masking context. The function is fixed through the .env pronoun.

df <- data.frame(factor = 0, value = 1)

# Without .env pronoun
my_rescale <- function(data, var, factor = 10) {
  data %>% dplyr::mutate("{{ var }}" := {{ var }} / factor)
}

# Oh no!
df %>% my_rescale(value)
#>   factor value
#> 1      0   Inf

# With .env pronoun to ensure factor argument is used
my_rescale <- function(data, var, factor = 10) {
  data %>% dplyr::mutate("{{ var }}" := {{ var }} / .env$factor)
}

# Yay!
data.frame(factor = 0, value = 1) %>% my_rescale(value)
#>   factor value
#> 1      0   0.1

Subsetting .data with env-variables

The use of .data[[var]] pattern to bridge from name to data mask is insulated from column name collisions. You can only subset the .data pronoun with environment variables not data variables. [[ works as an injection operator when applied to .data and so is evaluated before the data mask is created.

var <- "cyl"

mtcars2 <- mtcars
mtcars2$var <- "wrong"

mtcars2 %>% dplyr::summarise(mean = mean(.data[[var]]))
#>     mean
#> 1 6.1875

Injecting env-variables with !!

As noted above, injection operators modify a piece of code early in the evaluation process before any data-masking logic occurs. “If you inject the value of a variable, it becomes inlined in the expression. R no longer needs to look up any variable to find the value.”

Injection with !! can be used to solve the same problem as using .data and .env pronouns, but the current advice is that it is preferable to use the pronouns instead of the injection operators.

df <- data.frame(x = 1, y = 2)
x <- 100

# .data and .env pronouns
df %>% dplyr::mutate(y = .data$y / .env$x)
#>   x    y
#> 1 1 0.02

# Injection
df %>% dplyr::mutate(y = y / !!x)
#>   x    y
#> 1 1 0.02

No ambiguity in tidy selections

“The selection language is designed in such a way that evaluation of expressions is either scoped in the data mask only, or in the environment only.” For instance, in the code below data is a symbol given to the selection operator :. It is scoped in the data mask only and, therefore, refers to the “data” column. ncol(data) is evaluated as normal R code. It is an environmental expression referring to the environmental variable of the data data frame.

data <- data.frame(x = 1, data = 1:3)

data %>% dplyr::select(data:ncol(data))
#>   data
#> 1    1
#> 2    2
#> 3    3

28.5 The double evaluation problem

A problem with metaprogramming is that it introduces the ability to evaluate the same code multiple times when a piece of code is contained within a data-masking context that is evaluated in multiple places. For instance, a function that summarizes multiple functions on a single column has the potential to run twice if there is also a computation (mutate()-like functionality) on the column. The following function seems to work as expected.

summarise_stats <- function(data, var) {
  data %>%
    dplyr::summarise(
      mean = mean({{ var }}),
      sd = sd({{ var }})
    )
}

summarise_stats(mtcars, cyl)
#>     mean       sd
#> 1 6.1875 1.785922

However, if a computation is added to var, that computation will be run on the var column for both the mean() and sd() calculations. Thus, if you multiply cyl by 100, that code is evaluated twice.

summarise_stats(mtcars, cyl * 100)
#>     mean       sd
#> 1 618.75 178.5922

The output is correct, but the code will take longer to evaluate. Below shows what is actually happening in the code because a defused expression is injected in two places. The caret signs represent quosure boundaries.

dplyr::summarise(
  mean = ^mean(^cyl * 100),
  sd = ^sd(^cyl * 100)
)

We can confirm this by creating a function with a side effect of printing some messages and running it on cyl.

times100 <- function(x) {
  message("Takes a long time...")
  Sys.sleep(0.1)

  message("And causes side effects such as messages!")
  x * 100
}

summarise_stats(mtcars, times100(cyl))
#> Takes a long time...
#> And causes side effects such as messages!
#> Takes a long time...
#> And causes side effects such as messages!
#>     mean       sd
#> 1 618.75 178.5922

The issue of double evaluation can be fixed by ensuring that any computations on var are performed before the summarise() function. This can be done with mutate(.keep = "none").

summarise_stats <- function(data, var) {
  data %>%
    # Evaluate calculations on val
    dplyr::mutate(var = {{ var }}, .keep = "none") %>%
    # Then summarise
    dplyr::summarise(mean = mean(var),
                    sd = sd(var))
}

# Now the defused input is only evaluated the one time in mutate
summarise_stats(mtcars, times100(cyl))
#> Takes a long time...
#> And causes side effects such as messages!
#>     mean       sd
#> 1 618.75 178.5922

28.6 What happens if I use injection operators out of context?

Injection operators {{, !!, and !!! are parts of tidy evaluation and not part of base R. Therefore, they are special characters that should only be used in data-masked arguments powered by tidy eval. Outside of the context of tidy eval data masks they have different meaning.

28.6.1 Using {{ out of context

In R { is like ( but takes multiple expressions instead of one. Wrapping an expression in multiple curly brackets does not do anything special.

# Multiple expressions
list(
  { message("foo"); 2 }
)
#> foo
#> [[1]]
#> [1] 2

{{ 2 }}
#> [1] 2

Here, the result is at worst a silent error. However, an error will occur if {{ is used in a base R data mask.

my_mean <- function(data, var) {
  with(data, mean({{ var }}))
}

my_mean(mtcars, cyl)
#> Error in eval(expr, envir, enclos): object 'cyl' not found

28.6.2 Using !! and !!! out of context

!! and !!! are interpreted as double and triple negation in regular R code.

!! TRUE
#> [1] TRUE
!!! TRUE
#> [1] FALSE