26 Programming with dplyr
Programming with dplyr vignette
This vignette covers tidy evaluation. The language used to describe tidy evaluation was changed greatly with the release of dplyr 1.0.0
on 29 May 2020. This coincided with rlang
changes with version 0.4.0 from June 2019 that introduced {{}}
(curly-curly) and this was confirmed by the rewriting of the rlang
tidy evaluation and metaprogramming vignettes with rlang 1.0.0
released 26 January 2022.
This vignette provides a basic overview of the main user-facing features of tidy evaluation and the new nomenclature used for tidy evaluation. For the previous language see Wickham, Advanced R - Metaprogramming and the Programming with dplyr
vignette before 1.0.0.
The vignette divides the concept of tidy evaluation into two main parts: data masking and tidy selection. Data masking allows you to “use data variables as if they were variables in the environment.” tidy selection makes it “so you can easily choose variables based on their position, name, or type.”
26.1 Data masking
Data masking allows you to refer to variables in data frames (data-variables) as if they were objects in your R environment (env-variables). This blurring of the meaning of “variable” is useful within interactive data analysis, but it introduces problems when programming with these tools.
26.1.1 Indirection
The concept of indirection is a replacement for the language of quasiquotation. Indirection occurs “when you want to get the data-variable from an env-variable instead of directly typing the data-variable’s name.” There are two main cases:
- Data-variable in a function argument: Need to embrace the argument with curly-curly (
{{
).
- Environment-variable that is a character vector: Need to index into the
.data
pronoun.
.data
is not a data frame but a pronoun that provides access to current variables by either referring directly to the column with .data$x
or indirectly through a character vector with .data[[var]]
.
26.1.2 Name injection
Name injection is related to dynamic dots, which makes it possible to generate names programmatically with :=
. There are two forms of name injection:
- If the name is an env-variable, use
glue
syntax.
name <- "susan"
tibble("{name}" := 2)
#> # A tibble: 1 × 1
#> susan
#> <dbl>
#> 1 2
- If the name is derived from a data-variable in an argument, use embracing syntax.
my_df <- function(x) {
tibble("{{x}}_2" := x * 2)
}
y <- 10
my_df(y)
#> # A tibble: 1 × 1
#> y_2
#> <dbl>
#> 1 20
26.2 Tidy selection
The capabilities of tidy selection are based on the tidyselect package. Tidy select provides a domain specific language to select columns by name, position, or type.
26.2.1 Indirection
Indirection with tidy select occurs when column selection is stored in an intermediate variable. There are two main cases:
- Data-variable in a function argument: Need to embrace the argument with curly-curly (
{{
).
summarise_mean <- function(data, vars) {
data %>% summarise(n = n(), across({{ vars }}, mean))
}
mtcars %>%
group_by(cyl) %>%
summarise_mean(where(is.numeric))
#> # A tibble: 3 × 12
#> cyl n mpg disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 4 11 26.7 105. 82.6 4.07 2.29 19.1 0.909 0.727 4.09 1.55
#> 2 6 7 19.7 183. 122. 3.59 3.12 18.0 0.571 0.429 3.86 3.43
#> 3 8 14 15.1 353. 209. 3.23 4.00 16.8 0 0.143 3.29 3.5
26.3 How-tos
26.3.1 Eliminating R CMD check
NOTE
s
If you have a function that uses data masking or tidy selection the variables used within the function will lead to a note about undefined global variables. There are two ways to eliminate this note depending on whether it derives from data masking or tidy selection.
- Data masking: use
.data$var
and import.data
from its source in therlang
package. - Tidy selection: use
"var"
instead ofvar
.
26.3.2 User-suplied expressions in function arguments
Use embracing to capture and inject the expression into the function.
To use the name of the variable in the output embrace the variable on the left side and use {{
to embrace.
26.3.3 Any number of user-supplied expressions
Use ...
to take any number of user-specified expressions. When using ...
all named arguments should begin with .
to minimize chances for argument clashes. See the tidyverse design guide for details.
my_summarise <- function(.data, ...) {
.data %>%
group_by(...) %>%
summarise(mass = mean(mass, na.rm = TRUE),
height = mean(height, na.rm = TRUE))
}
starwars %>% my_summarise(homeworld, gender)
#> `summarise()` has grouped output by 'homeworld'. You can override using the
#> `.groups` argument.
#> # A tibble: 57 × 4
#> # Groups: homeworld [49]
#> homeworld gender mass height
#> <chr> <chr> <dbl> <dbl>
#> 1 Alderaan feminine 49 150
#> 2 Alderaan masculine 79 190.
#> 3 Aleen Minor masculine 15 79
#> 4 Bespin masculine 79 175
#> 5 Bestine IV <NA> 110 180
#> 6 Cato Neimoidia masculine 90 191
#> 7 Cerea masculine 82 198
#> 8 Champala masculine NaN 196
#> 9 Chandrila feminine NaN 150
#> 10 Concord Dawn masculine 79 183
#> # ℹ 47 more rows
26.3.4 Transforming user-supplied variables
Use across()
and pick()
(new with dplyr 1.1.0
) to transform sets of data variables. You can also use the .names
argument to across()
to control the names of the output columns.
my_summarise <- function(data, group_var, summarise_var) {
data %>%
group_by(pick({{ group_var }})) %>%
summarise(across({{ summarise_var }},
~ mean(., na.rm = TRUE),
.names = "mean_{.col}"))
}
my_summarise(starwars,
group_var = c(species, gender),
summarise_var = c(mass, height))
#> `summarise()` has grouped output by 'species'. You can override using the
#> `.groups` argument.
#> # A tibble: 42 × 4
#> # Groups: species [38]
#> species gender mean_mass mean_height
#> <chr> <chr> <dbl> <dbl>
#> 1 Aleena masculine 15 79
#> 2 Besalisk masculine 102 198
#> 3 Cerean masculine 82 198
#> 4 Chagrian masculine NaN 196
#> 5 Clawdite feminine 55 168
#> 6 Droid feminine NaN 96
#> 7 Droid masculine 69.8 140
#> 8 Dug masculine 40 112
#> 9 Ewok masculine 20 88
#> 10 Geonosian masculine 80 183
#> # ℹ 32 more rows