3 dplyr: Column-wise operations
This vignette centers on the use of across()
, which was introduced in 2020 with dplyr 1.0.0
. See dplyr 1.0.0 notes.
3.1 Basic usage
Two main arguments of across()
-
.cols
: selects the columns you want to operate on using tidy select syntax. -
.fns
: a function or list of functions to apply to each column.
across()
uses tidyselect and so uses helper functions such as where()
, starts_with()
and can use c()
to select multiple columns instead of the old function of vars()
.
starwars %>%
summarise(across(where(is.character), n_distinct))
#> # A tibble: 1 × 8
#> name hair_color skin_color eye_color sex gender homeworld species
#> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 87 12 31 15 5 3 49 38
starwars %>%
summarise(across(c(sex, gender, homeworld), n_distinct))
#> # A tibble: 1 × 3
#> sex gender homeworld
#> <int> <int> <int>
#> 1 5 3 49
You can also apply a function with arguments, but with dplyr 1.1.0
you now need to use an anonymous function or lambda syntax.
starwars %>%
group_by(homeworld) %>%
filter(n() > 1) %>%
summarise(across(where(is.numeric), \(x) mean(x, na.rm = TRUE)), n = n())
#> # A tibble: 10 × 5
#> homeworld height mass birth_year n
#> <chr> <dbl> <dbl> <dbl> <int>
#> 1 Alderaan 176. 64 43 3
#> 2 Corellia 175 78.5 25 2
#> 3 Coruscant 174. 50 91 3
#> 4 Kamino 208. 83.1 31.5 3
#> 5 Kashyyyk 231 124 200 2
#> 6 Mirial 168 53.1 49 2
#> 7 Naboo 177. 64.2 55 11
#> 8 Ryloth 179 55 48 2
#> 9 Tatooine 170. 85.4 54.6 10
#> 10 <NA> 139. 82 334. 10
3.2 Multiple functions
You can transform each variable with more than one function by supplying a named list of functions, lambda functions, or anonymous functions in the second argument.
min_max <- list(
min = \(x) min(x, na.rm = TRUE),
max = \(x) max(x, na.rm = TRUE)
)
starwars %>%
summarise(across(where(is.numeric), min_max))
#> # A tibble: 1 × 6
#> height_min height_max mass_min mass_max birth_year_min birth_year_max
#> <int> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 66 264 15 1358 8 896
You can control the names created for the columns with the .names
argument and glue style syntax.
3.3 Gotchas
Be careful when combining numeric summaries with where(is.numeric)
. For instance, if using n = n()
, make sure that it comes after the transformation of across(where(is.numeric)
.
Another way to do this and make it more explicit is to call tibble()
within summarise()
to create a new tibble from the different pieces.
3.4 filter()
and across()
Cannot directly use across()
and tidyselect methods with filter because you need another step to combine the results. This can be done with if_any()
and if_all()
.
3.5 Replacing _if
, _at
, and _all
-
across()
makes it possible to compute useful summaries that were previously impossible. For example, it’s now easy to summarise numeric vectors with one function, factors with another, and still compute the number of rows in each group. -
across()
reduces the number of functions thatdplyr
needs to provide. - With the
where()
helper,across()
unifies_if
and_at
semantics, allowing combinations that used to be impossible. For example, you can now transform all numeric columns whose name begins with “x”:across(where(is.numeric) & starts_with("x"))
. -
across()
doesn’t needvars()
.