28 rlang: Tidy evaluation
28.1 rlang
Tidy evaluation vignettes
28.2 What is data-masking and why do I need {{
?
“Data-masking is a distinctive feature of R whereby programming is performed directly on a data set, with columns defined as normal objects.” This is achieved by defusing R code (quoting it) and then injecting (unquoting) the defused expression in the correct context of the data frame instead of the user environment.
If you pass arguments to a data-masking function in a normal way such as dplyr::summarise()
the arguments are defused, but the user-defined arguments are not injected. For instance, below my_mean()
does not know to look for cyl
and am
in the mtcars
data frame and instead looks for them in the user environment.
If you introduce objects named cyl
and am
into the user environment, those will be used and the mtcars
data frame will not be used at all. Notice that the column is named mean(var1 + var2)
, just like the error message above, indicating the actual code that is being run.
cyl <- 1
am <- 2
my_mean(mtcars, cyl, am)
#> mean(var1 + var2)
#> 1 3
To inject a function argument in a data masking context use the embracing syntax curly-curly ({{
). Note that when this is done the column is correctly named mean(cyl + am)
and the names properly refer to variables in the mtcars
data frame and not the user environment.
28.2.1 What does “masking” mean?
Data masking occurs by placing the data frame at the bottom of the chain of environments so that it takes precedence over the user environment. It thus masks the user environment. This means that data masking functions will use a data frame variable instead of a variable in the user environment as in the above my_mean()
function. Tidy eval provides .data
and .env
pronouns to help deal with this ambiguity.
28.2.2 How does data-masking work?
Data masking relies on three language features of R:
- Argument defusal
- First class environments: Environments are a special type of list-like object in which defused R code can be evaluated.
- Explicit evaluation with
eval()
(base) oreval_tidy()
(rlang).
The below code brings these three features together: the code is defused (quoted) and then explicitly evaluated within the environment of the mtcars
data frame instead of the default user environment.
28.3 Data mask programming patterns
There are two main considerations when determining which programming pattern should be used to wrap a data-masking function:
- What behavior does the wrapped function implement?
- What behavior should your function implement?
28.3.1 Argument behaviors
Data masking arguments are not only defined by the type of objects they accept but also the special computational behaviors they exhibit. Options include:
- Base data-masked expressions (e.g.
with()
): Expressions may refer to the columns of the supplied data frame. - Tidy eval data-masked expressions: Same as base data-masked expressions but with addition features such as injection operators:
{{
and!!
and the.data
and.env
pronouns. - Data-masked symbols: Supplied expressions must be simple column names.
- Tidy selections: Tidy selection is an alternative to data masking and does not involve masking. Expressions are either interpreted in the context of a data frame (
c(cyl, am)
) or evaluated in the user environment (starts_with()
). - Dynamic dots: These may be data-masked arguments, tidy selections, or just regular arguments.
You can include documentation about the three main tidy eval options with the following tags:
-
@param foo <[`data-masked`][dplyr::dplyr_data_masking]> What `foo` does
. @param bar <[`tidy-select`][dplyr::dplyr_tidy_select]> What `bar` does.
@param ... <[`dynamic-dots`][rlang::dyn-dots]> What these dots do.
28.3.2 Forwarding patterns
Your function inherits the behavior of the function it interfaces with. In both data masking and tidy selection contexts use the embrace operator ({{
).
The behavior of my_summarise()
is the same as dplyr::summarise()
. This includes the ability to use the .data
pronoun to refer to columns. The below both work in the same way.
Dots can be forwarded by simply passing them on to another argument.
There are some tidy selection functions that use a single named argument instead of ...
such as pivot_longer()
. In that case, pass the ...
inside c()
, which acts as a selection combinator in this context.
my_pivot_longer <- function(.data, ...) {
.data %>% tidyr::pivot_longer(c(...))
}
28.3.3 Names patterns
Your function takes strings or character vectors to refer to column names.
The .data
pronoun is a tidy eval feature enabled within data-masked arguments and represents the data mask. It can be subset with [[
and $
. The three below statements are equivalent just as above with my_summarise()
.
You can also use the .data
pronoun to connect function arguments to a data-variable. This insulates the function from data-masking behavior. Notice that my_mean()
now needs a character vector and uses the environmental variable equivalent to "cyl"
instead of the data variable of am
.
.data
does not support character vectors of length greater than one. For character vectors of names greater than one use all_of()
or any_of()
.
vars <- c("cyl", "am")
mtcars %>% tidyr::pivot_longer(all_of(vars))
#> # A tibble: 64 × 11
#> mpg disp hp drat wt qsec vs gear carb name value
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
#> 1 21 160 110 3.9 2.62 16.5 0 4 4 cyl 6
#> 2 21 160 110 3.9 2.62 16.5 0 4 4 am 1
#> 3 21 160 110 3.9 2.88 17.0 0 4 4 cyl 6
#> 4 21 160 110 3.9 2.88 17.0 0 4 4 am 1
#> 5 22.8 108 93 3.85 2.32 18.6 1 4 1 cyl 4
#> 6 22.8 108 93 3.85 2.32 18.6 1 4 1 am 1
#> 7 21.4 258 110 3.08 3.22 19.4 1 3 1 cyl 6
#> 8 21.4 258 110 3.08 3.22 19.4 1 3 1 am 0
#> 9 18.7 360 175 3.15 3.44 17.0 0 3 2 cyl 8
#> 10 18.7 360 175 3.15 3.44 17.0 0 3 2 am 0
#> # ℹ 54 more rows
28.3.4 Bridge patterns
You change the behavior of an argument instead of inheriting it.
You can use across()
or pick()
as a bridge between selection and data masking.
my_group_by <- function(data, cols) {
group_by(data, pick({{ cols }}))
}
mtcars %>% my_group_by(starts_with("c"))
#> # A tibble: 32 × 11
#> # Groups: cyl, carb [9]
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # ℹ 22 more rows
pick()
, unlike across()
takes dynamic dots, so you can also just pass on the dots. With across()
you need to collect the dots with c(...)
.
my_group_by <- function(.data, ...) {
group_by(.data, pick(...))
}
mtcars %>% my_group_by(starts_with("c"), vs:gear)
#> # A tibble: 32 × 11
#> # Groups: cyl, carb, vs, am, gear [15]
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # ℹ 22 more rows
Use all_of()
or any_of()
to bridge names, or character vectors, to a data mask.
my_group_by <- function(data, vars) {
data %>% dplyr::group_by(pick(all_of(vars)))
}
mtcars %>% my_group_by(c("cyl", "am"))
#> # A tibble: 32 × 11
#> # Groups: cyl, am [6]
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # ℹ 22 more rows
Use mutate(.keep = "none")
to bridge data-mask to selection. This pattern is a little trickier and uses mutate()
as a way to inspect the names passed to ...
and make sure that they are included in the data frame. The column names and any transformation are done with the splice operator (!!!
) and then the selection can be passed to pivot_longer()
through all_of()
. For the output, look to the columns on the right that show name and value. For another way to do this using a symbolize and inject pattern, see Metaprogramming patterns - Bridge patterns.
my_pivot_longer <- function(data, ...) {
# Forward `...` in data-mask context with `mutate(.keep = "none")`
# to create a new data frame and save the inputs names
inputs <- dplyr::mutate(data, ..., .keep = "none")
names <- names(inputs)
# Update the data with the inputs
data <- dplyr::mutate(data, !!!inputs)
# Select the inputs by name with `all_of()`
tidyr::pivot_longer(data, cols = all_of(names))
}
mtcars %>% my_pivot_longer(cyl, am = am * 100)
#> # A tibble: 64 × 11
#> mpg disp hp drat wt qsec vs gear carb name value
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
#> 1 21 160 110 3.9 2.62 16.5 0 4 4 cyl 6
#> 2 21 160 110 3.9 2.62 16.5 0 4 4 am 100
#> 3 21 160 110 3.9 2.88 17.0 0 4 4 cyl 6
#> 4 21 160 110 3.9 2.88 17.0 0 4 4 am 100
#> 5 22.8 108 93 3.85 2.32 18.6 1 4 1 cyl 4
#> 6 22.8 108 93 3.85 2.32 18.6 1 4 1 am 100
#> 7 21.4 258 110 3.08 3.22 19.4 1 3 1 cyl 6
#> 8 21.4 258 110 3.08 3.22 19.4 1 3 1 am 0
#> 9 18.7 360 175 3.15 3.44 17.0 0 3 2 cyl 8
#> 10 18.7 360 175 3.15 3.44 17.0 0 3 2 am 0
#> # ℹ 54 more rows
28.3.5 Transformation patterns
You can transform inputs with across()
by forwarding ...
to across()
and performing an action on it. This uses ...
to inherit tidy selection behavior. For another way to do this using a symbolize and inject pattern, see Metaprogramming patterns - Transformation patterns.
my_mean <- function(data, ...) {
data %>% dplyr::summarise(
across(c(...), ~ mean(.x, na.rm = TRUE))
)
}
mtcars %>% my_mean(cyl, carb)
#> cyl carb
#> 1 6.1875 2.8125
mtcars %>% my_mean(foo = cyl, bar = carb)
#> foo bar
#> 1 6.1875 2.8125
mtcars %>% my_mean(starts_with("c"), mpg:disp)
#> cyl carb mpg disp
#> 1 6.1875 2.8125 20.09062 230.7219
filter()
necessitates a different pattern because it is built on logical expressions. if_all()
and if_any()
provide variants of across()
suitable to use in filter. For instance, creating a function to filter all rows for which a set of variables are not equal to their minimum value.
filter_non_baseline <- function(.data, ...) {
.data %>% dplyr::filter(if_all(c(...), ~ .x != min(.x, na.rm = TRUE)))
}
mtcars %>% filter_non_baseline(vs, am, gear)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
#> Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#> Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
#> Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
#> Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
#> Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
#> Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
28.4 The data mask ambiguity
The convenience of data masking makes it possible to refer to both columns in data frames and objects in the user environment. However, this convenience introduces ambiguity.
For instance, which value of x
is being referred to in the mutate()
function. The problem occurs when you want to use an object from the user environment but there is a column with the same name.
df <- data.frame(x = NA, y = 2)
x <- 100
df %>% dplyr::mutate(y = y / x)
#> x y
#> 1 NA NA
Another issue occurs when you have a typo in a data-variable name or you were expecting a column that is missing and there is an object with that name in the user environment. In a data-masking context if a variable cannot be found in the data mask, R looks for variables in the surrounding environment.
df <- data.frame(foo = "right")
ffo <- "wrong"
df %>% dplyr::mutate(foo = toupper(ffo))
#> foo
#> 1 WRONG
28.4.1 Preventing collisions
The .data
and .env
pronouns
The easiest solution to disambiguate between data-variables and environment-variables is to use the .data
and .env
pronouns.
df <- data.frame(x = 1, y = 2)
x <- 100
df %>% dplyr::mutate(y = .data$y / .env$x)
#> x y
#> 1 1 0.02
This is particularly useful when using named arguments with values in a function to avoid name conflicts with data frames. Use the .env
pronoun for any environment variables scoped in the function to avoid hitting a masking column. The below example shows how the factor
column is given preference over the argument factor
in a data-masking context. The function is fixed through the .env
pronoun.
df <- data.frame(factor = 0, value = 1)
# Without .env pronoun
my_rescale <- function(data, var, factor = 10) {
data %>% dplyr::mutate("{{ var }}" := {{ var }} / factor)
}
# Oh no!
df %>% my_rescale(value)
#> factor value
#> 1 0 Inf
# With .env pronoun to ensure factor argument is used
my_rescale <- function(data, var, factor = 10) {
data %>% dplyr::mutate("{{ var }}" := {{ var }} / .env$factor)
}
# Yay!
data.frame(factor = 0, value = 1) %>% my_rescale(value)
#> factor value
#> 1 0 0.1
Subsetting .data
with env-variables
The use of .data[[var]]
pattern to bridge from name to data mask is insulated from column name collisions. You can only subset the .data
pronoun with environment variables not data variables. [[
works as an injection operator when applied to .data
and so is evaluated before the data mask is created.
Injecting env-variables with !!
As noted above, injection operators modify a piece of code early in the evaluation process before any data-masking logic occurs. “If you inject the value of a variable, it becomes inlined in the expression. R no longer needs to look up any variable to find the value.”
Injection with !!
can be used to solve the same problem as using .data
and .env
pronouns, but the current advice is that it is preferable to use the pronouns instead of the injection operators.
df <- data.frame(x = 1, y = 2)
x <- 100
# .data and .env pronouns
df %>% dplyr::mutate(y = .data$y / .env$x)
#> x y
#> 1 1 0.02
# Injection
df %>% dplyr::mutate(y = y / !!x)
#> x y
#> 1 1 0.02
No ambiguity in tidy selections
“The selection language is designed in such a way that evaluation of expressions is either scoped in the data mask only, or in the environment only.” For instance, in the code below data
is a symbol given to the selection operator :
. It is scoped in the data mask only and, therefore, refers to the “data” column. ncol(data)
is evaluated as normal R code. It is an environmental expression referring to the environmental variable of the data
data frame.
data <- data.frame(x = 1, data = 1:3)
data %>% dplyr::select(data:ncol(data))
#> data
#> 1 1
#> 2 2
#> 3 3
28.5 The double evaluation problem
A problem with metaprogramming is that it introduces the ability to evaluate the same code multiple times when a piece of code is contained within a data-masking context that is evaluated in multiple places. For instance, a function that summarizes multiple functions on a single column has the potential to run twice if there is also a computation (mutate()
-like functionality) on the column. The following function seems to work as expected.
However, if a computation is added to var
, that computation will be run on the var
column for both the mean()
and sd()
calculations. Thus, if you multiply cyl
by 100, that code is evaluated twice.
summarise_stats(mtcars, cyl * 100)
#> mean sd
#> 1 618.75 178.5922
The output is correct, but the code will take longer to evaluate. Below shows what is actually happening in the code because a defused expression is injected in two places. The caret signs represent quosure boundaries.
::summarise(
dplyrmean = ^mean(^cyl * 100),
sd = ^sd(^cyl * 100)
)
We can confirm this by creating a function with a side effect of printing some messages and running it on cyl
.
times100 <- function(x) {
message("Takes a long time...")
Sys.sleep(0.1)
message("And causes side effects such as messages!")
x * 100
}
summarise_stats(mtcars, times100(cyl))
#> Takes a long time...
#> And causes side effects such as messages!
#> Takes a long time...
#> And causes side effects such as messages!
#> mean sd
#> 1 618.75 178.5922
The issue of double evaluation can be fixed by ensuring that any computations on var
are performed before the summarise()
function. This can be done with mutate(.keep = "none")
.
summarise_stats <- function(data, var) {
data %>%
# Evaluate calculations on val
dplyr::mutate(var = {{ var }}, .keep = "none") %>%
# Then summarise
dplyr::summarise(mean = mean(var),
sd = sd(var))
}
# Now the defused input is only evaluated the one time in mutate
summarise_stats(mtcars, times100(cyl))
#> Takes a long time...
#> And causes side effects such as messages!
#> mean sd
#> 1 618.75 178.5922
28.6 What happens if I use injection operators out of context?
Injection operators {{
, !!
, and !!!
are parts of tidy evaluation and not part of base R. Therefore, they are special characters that should only be used in data-masked arguments powered by tidy eval. Outside of the context of tidy eval data masks they have different meaning.
28.6.1 Using {{
out of context
In R {
is like (
but takes multiple expressions instead of one. Wrapping an expression in multiple curly brackets does not do anything special.
Here, the result is at worst a silent error. However, an error will occur if {{
is used in a base R data mask.
28.6.2 Using !!
and !!!
out of context
!!
and !!!
are interpreted as double and triple negation in regular R code.
!! TRUE
#> [1] TRUE
!!! TRUE
#> [1] FALSE