26  Programming with dplyr

Published

February 15, 2023

Modified

January 8, 2024

Programming with dplyr vignette

This vignette covers tidy evaluation. The language used to describe tidy evaluation was changed greatly with the release of dplyr 1.0.0 on 29 May 2020. This coincided with rlang changes with version 0.4.0 from June 2019 that introduced {{}} (curly-curly) and this was confirmed by the rewriting of the rlang tidy evaluation and metaprogramming vignettes with rlang 1.0.0 released 26 January 2022.

This vignette provides a basic overview of the main user-facing features of tidy evaluation and the new nomenclature used for tidy evaluation. For the previous language see Wickham, Advanced R - Metaprogramming and the Programming with dplyr vignette before 1.0.0.

The vignette divides the concept of tidy evaluation into two main parts: data masking and tidy selection. Data masking allows you to “use data variables as if they were variables in the environment.” tidy selection makes it “so you can easily choose variables based on their position, name, or type.”

26.1 Data masking

Data masking allows you to refer to variables in data frames (data-variables) as if they were objects in your R environment (env-variables). This blurring of the meaning of “variable” is useful within interactive data analysis, but it introduces problems when programming with these tools.

26.1.1 Indirection

The concept of indirection is a replacement for the language of quasiquotation. Indirection occurs “when you want to get the data-variable from an env-variable instead of directly typing the data-variable’s name.” There are two main cases:

  1. Data-variable in a function argument: Need to embrace the argument with curly-curly ({{).
var_summary <- function(data, var) {
  data %>%
    summarise(n = n(), min = min({{ var }}), max = max({{ var }}))
}
  1. Environment-variable that is a character vector: Need to index into the .data pronoun.

.data is not a data frame but a pronoun that provides access to current variables by either referring directly to the column with .data$x or indirectly through a character vector with .data[[var]].

for (var in names(mtcars)) {
  mtcars %>% count(.data[[var]]) %>% print()
}

26.1.2 Name injection

Name injection is related to dynamic dots, which makes it possible to generate names programmatically with :=. There are two forms of name injection:

  1. If the name is an env-variable, use glue syntax.
name <- "susan"
tibble("{name}" := 2)
#> # A tibble: 1 × 1
#>   susan
#>   <dbl>
#> 1     2
  1. If the name is derived from a data-variable in an argument, use embracing syntax.
my_df <- function(x) {
  tibble("{{x}}_2" := x * 2)
}
y <- 10
my_df(y)
#> # A tibble: 1 × 1
#>     y_2
#>   <dbl>
#> 1    20

26.2 Tidy selection

The capabilities of tidy selection are based on the tidyselect package. Tidy select provides a domain specific language to select columns by name, position, or type.

26.2.1 Indirection

Indirection with tidy select occurs when column selection is stored in an intermediate variable. There are two main cases:

  1. Data-variable in a function argument: Need to embrace the argument with curly-curly ({{).
summarise_mean <- function(data, vars) {
  data %>% summarise(n = n(), across({{ vars }}, mean))
}
mtcars %>% 
  group_by(cyl) %>% 
  summarise_mean(where(is.numeric))
#> # A tibble: 3 × 12
#>     cyl     n   mpg  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     4    11  26.7  105.  82.6  4.07  2.29  19.1 0.909 0.727  4.09  1.55
#> 2     6     7  19.7  183. 122.   3.59  3.12  18.0 0.571 0.429  3.86  3.43
#> 3     8    14  15.1  353. 209.   3.23  4.00  16.8 0     0.143  3.29  3.5
  1. Environment-variable that is a character vector: Need to use all_of() or any_of() depending on whether you want the function to error if a variable is not found.
vars <- c("mpg", "vs")
mtcars %>% select(all_of(vars)) %>% head()
#>                    mpg vs
#> Mazda RX4         21.0  0
#> Mazda RX4 Wag     21.0  0
#> Datsun 710        22.8  1
#> Hornet 4 Drive    21.4  1
#> Hornet Sportabout 18.7  0
#> Valiant           18.1  1

26.3 How-tos

26.3.1 Eliminating R CMD check NOTEs

If you have a function that uses data masking or tidy selection the variables used within the function will lead to a note about undefined global variables. There are two ways to eliminate this note depending on whether it derives from data masking or tidy selection.

  1. Data masking: use .data$var and import .data from its source in the rlang package.
  2. Tidy selection: use "var" instead of var.
#' @importFrom rlang .data
my_summary_function <- function(data) {
  data %>% 
    select("grp", "x", "y") %>% 
    filter(.data$x > 0) %>% 
    group_by(.data$grp) %>% 
    summarise(y = mean(.data$y), n = n())
}

26.3.2 User-suplied expressions in function arguments

Use embracing to capture and inject the expression into the function.

my_summarise <- function(data, expr) {
  data %>% summarise(
    mean = mean({{ expr }}),
    sum = sum({{ expr }}),
    n = n()
  )
}

To use the name of the variable in the output embrace the variable on the left side and use {{ to embrace.

my_summarise2 <- function(data, mean_var, sd_var) {
  data %>% 
    summarise(
      "mean_{{mean_var}}" := mean({{ mean_var }}), 
      "sd_{{sd_var}}" := sd({{ sd_var }})
    )
}

26.3.3 Any number of user-supplied expressions

Use ... to take any number of user-specified expressions. When using ... all named arguments should begin with . to minimize chances for argument clashes. See the tidyverse design guide for details.

my_summarise <- function(.data, ...) {
  .data %>%
    group_by(...) %>%
    summarise(mass = mean(mass, na.rm = TRUE),
              height = mean(height, na.rm = TRUE))
}
starwars %>% my_summarise(homeworld, gender)
#> `summarise()` has grouped output by 'homeworld'. You can override using the
#> `.groups` argument.
#> # A tibble: 57 × 4
#> # Groups:   homeworld [49]
#>    homeworld      gender     mass height
#>    <chr>          <chr>     <dbl>  <dbl>
#>  1 Alderaan       feminine     49   150 
#>  2 Alderaan       masculine    79   190.
#>  3 Aleen Minor    masculine    15    79 
#>  4 Bespin         masculine    79   175 
#>  5 Bestine IV     <NA>        110   180 
#>  6 Cato Neimoidia masculine    90   191 
#>  7 Cerea          masculine    82   198 
#>  8 Champala       masculine   NaN   196 
#>  9 Chandrila      feminine    NaN   150 
#> 10 Concord Dawn   masculine    79   183 
#> # ℹ 47 more rows

26.3.4 Transforming user-supplied variables

Use across() and pick() (new with dplyr 1.1.0) to transform sets of data variables. You can also use the .names argument to across() to control the names of the output columns.

my_summarise <- function(data, group_var, summarise_var) {
  data %>%
    group_by(pick({{ group_var }})) %>% 
    summarise(across({{ summarise_var }},
                     ~ mean(., na.rm = TRUE),
                     .names = "mean_{.col}"))
}
my_summarise(starwars, 
             group_var = c(species, gender),
             summarise_var = c(mass, height))
#> `summarise()` has grouped output by 'species'. You can override using the
#> `.groups` argument.
#> # A tibble: 42 × 4
#> # Groups:   species [38]
#>    species   gender    mean_mass mean_height
#>    <chr>     <chr>         <dbl>       <dbl>
#>  1 Aleena    masculine      15            79
#>  2 Besalisk  masculine     102           198
#>  3 Cerean    masculine      82           198
#>  4 Chagrian  masculine     NaN           196
#>  5 Clawdite  feminine       55           168
#>  6 Droid     feminine      NaN            96
#>  7 Droid     masculine      69.8         140
#>  8 Dug       masculine      40           112
#>  9 Ewok      masculine      20            88
#> 10 Geonosian masculine      80           183
#> # ℹ 32 more rows