3  dplyr: Column-wise operations

Published

March 13, 2023

Modified

January 8, 2024

This vignette centers on the use of across(), which was introduced in 2020 with dplyr 1.0.0. See dplyr 1.0.0 notes.

3.1 Basic usage

Two main arguments of across()

  1. .cols: selects the columns you want to operate on using tidy select syntax.
  2. .fns: a function or list of functions to apply to each column.

across() uses tidyselect and so uses helper functions such as where(), starts_with() and can use c() to select multiple columns instead of the old function of vars().

starwars %>% 
  summarise(across(where(is.character), n_distinct))
#> # A tibble: 1 × 8
#>    name hair_color skin_color eye_color   sex gender homeworld species
#>   <int>      <int>      <int>     <int> <int>  <int>     <int>   <int>
#> 1    87         12         31        15     5      3        49      38

starwars %>% 
  summarise(across(c(sex, gender, homeworld), n_distinct))
#> # A tibble: 1 × 3
#>     sex gender homeworld
#>   <int>  <int>     <int>
#> 1     5      3        49

You can also apply a function with arguments, but with dplyr 1.1.0 you now need to use an anonymous function or lambda syntax.

starwars %>% 
  group_by(homeworld) %>% 
  filter(n() > 1) %>% 
  summarise(across(where(is.numeric), \(x) mean(x, na.rm = TRUE)), n = n())
#> # A tibble: 10 × 5
#>    homeworld height  mass birth_year     n
#>    <chr>      <dbl> <dbl>      <dbl> <int>
#>  1 Alderaan    176.  64         43       3
#>  2 Corellia    175   78.5       25       2
#>  3 Coruscant   174.  50         91       3
#>  4 Kamino      208.  83.1       31.5     3
#>  5 Kashyyyk    231  124        200       2
#>  6 Mirial      168   53.1       49       2
#>  7 Naboo       177.  64.2       55      11
#>  8 Ryloth      179   55         48       2
#>  9 Tatooine    170.  85.4       54.6    10
#> 10 <NA>        139.  82        334.     10

3.2 Multiple functions

You can transform each variable with more than one function by supplying a named list of functions, lambda functions, or anonymous functions in the second argument.

min_max <- list(
  min = \(x) min(x, na.rm = TRUE),
  max = \(x) max(x, na.rm = TRUE)
)
starwars %>% 
  summarise(across(where(is.numeric), min_max))
#> # A tibble: 1 × 6
#>   height_min height_max mass_min mass_max birth_year_min birth_year_max
#>        <int>      <int>    <dbl>    <dbl>          <dbl>          <dbl>
#> 1         66        264       15     1358              8            896

You can control the names created for the columns with the .names argument and glue style syntax.

starwars %>% 
  summarise(across(where(is.numeric), min_max, .names = "{.fn}.{.col}"))
#> # A tibble: 1 × 6
#>   min.height max.height min.mass max.mass min.birth_year max.birth_year
#>        <int>      <int>    <dbl>    <dbl>          <dbl>          <dbl>
#> 1         66        264       15     1358              8            896

3.3 Gotchas

Be careful when combining numeric summaries with where(is.numeric). For instance, if using n = n(), make sure that it comes after the transformation of across(where(is.numeric).

df <- data.frame(x = c(1, 2, 3), y = c(1, 4, 9))

df %>% 
  summarise(across(where(is.numeric), sd), n = n())
#>   x        y n
#> 1 1 4.041452 3

Another way to do this and make it more explicit is to call tibble() within summarise() to create a new tibble from the different pieces.

df %>% 
  summarise(
    tibble(n = n(), across(where(is.numeric), sd))
  )
#>   n x        y
#> 1 3 1 4.041452

3.4 filter() and across()

Cannot directly use across() and tidyselect methods with filter because you need another step to combine the results. This can be done with if_any() and if_all().

  • if_any() keeps the rows where the predicate is true for at least one selected column.
  • if_all() keeps the rows where the predicate is true for all selected columns.
nrow(starwars)
#> [1] 87

# Keep rows with at least one non-NA value
starwars %>% 
  filter(if_any(everything(), ~ !is.na(.x))) %>% 
  nrow()
#> [1] 87

# Keep rows that do not have any NA values
starwars %>% 
  filter(if_all(everything(), ~ !is.na(.x))) %>% 
  nrow()
#> [1] 29

3.5 Replacing _if, _at, and _all

  1. across() makes it possible to compute useful summaries that were previously impossible. For example, it’s now easy to summarise numeric vectors with one function, factors with another, and still compute the number of rows in each group.
  2. across() reduces the number of functions that dplyr needs to provide.
  3. With the where() helper, across() unifies _if and _at semantics, allowing combinations that used to be impossible. For example, you can now transform all numeric columns whose name begins with “x”: across(where(is.numeric) & starts_with("x")).
  4. across() doesn’t need vars().