5 dplyr 1.0.0
dplyr 1.0.0
was released on 1 June 2020.
5.1 dplyr 1.0.0 Blog posts
5.2 Overview of new features
- Better support for row-wise operations: Row-wise operations.
- A new, simpler, approach to column-wise operations: Column-wise operations
-
select()
can select columns based on their type, and has a new syntax that better matches how you describe selections in English. - A new
relocate()
verb makes it easier change the position of columns. - New way to program with dplyr. See Programming with dplyr notes.
-
dplyr
is now based on the vctrs package.
5.3 New summarise() features
New feature to allow multiple summarizations per group, outputting multiple rows. This ability was removed in dplyr 1.1.0 and moved to the new function reframe()
. See reframe()
.
summarise()
also gains a new .groups
argument to control how groups are dropped if summarise()
is used on a grouped data frame. See summarise()
and grouping. The options are:
- Addition of a
.groups
argument to: -
"drop_last"
: (default) drops the last grouping level. -
"drop"
: drops all grouping levels. -
"keep"
preserves the grouping of the input. -
"rowwise"
turns each row into its own group.
5.4 select()
, rename()
, relocate()
These features are implemented in the tidyselect package. See dplyr
Argument type: tidy-select.
5.4.1 Five ways to select variables in select()
and rename()
:
- Position:
df %>% select(1:4)
- Generally not recommended, but it can be very useful, particularly if the variable names are very long, non-syntactic, or duplicated.
- Name:
df %>% select(a, e, j)
- Function of name:
df %>% select(starts_with("x"))
- Helper functions:
starts_with()
,ends_with()
,contains()
,matches()
- Helper functions:
- Type:
df %>% select(where(is.numeric))
- Any combination with Boolean operators
!
,&
, and|
:df %>% select(!where(is.factor))
5.4.2 Programming
-
any_of()
: Takes a character vector of variable names and silently ignores the missing columns. -
all_of()
throws an error if a column name is missing.
df <- tibble(x1 = 1, x2 = "a", x3 = 2, y1 = "b", y2 = 3, y3 = "c", y4 = 4)
vars <- c("x1", "x2", "y1", "z")
df %>% select(any_of(vars))
#> # A tibble: 1 × 3
#> x1 x2 y1
#> <dbl> <chr> <chr>
#> 1 1 a b
# all_of() errors if variable is missing
df %>% select(all_of(vars))
#> Error in `all_of()`:
#> ! Can't subset columns that don't exist.
#> ✖ Column `z` doesn't exist.
rename_with()
makes it easier to rename variables programmatically. It supersedes rename_if()
and rename_at()
.
df %>% rename_with(toupper)
#> # A tibble: 1 × 7
#> X1 X2 X3 Y1 Y2 Y3 Y4
#> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl>
#> 1 1 a 2 b 3 c 4
You can optionally choose which columns to apply the transformation to:
df %>% rename_with(toupper, starts_with("x"))
#> # A tibble: 1 × 7
#> X1 X2 X3 y1 y2 y3 y4
#> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl>
#> 1 1 a 2 b 3 c 4
df %>% rename_with(toupper, where(is.numeric))
#> # A tibble: 1 × 7
#> X1 x2 X3 y1 Y2 y3 Y4
#> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl>
#> 1 1 a 2 b 3 c 4
5.4.3 relocate()
relocate()
is a specialized function to move columns around. The default behavior is to move columns to the front, to the left.
If you want to move columns to a different position use .before
or .after
:
Use last_col()
to move to the right-hand side:
5.5 Working across columns
See Column-wise operations vignette and notes on the vignette
Use of across()
to replace and supersede _if()
, _at(
) and _all()
suffix versions of summarise()
and mutate()
.
5.5.1 Basic usage
Two main arguments of across()
-
.cols
: selects the columns you want to operate on using tidy select syntax. -
.fns
: a function or list of functions to apply to each column.
across()
uses tidyselect and so uses helper functions such as where()
, starts_with()
and can use c()
to select multiple columns instead of the old function of vars()
.
starwars %>%
summarise(across(where(is.character), n_distinct))
#> # A tibble: 1 × 8
#> name hair_color skin_color eye_color sex gender homeworld species
#> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 87 12 31 15 5 3 49 38
starwars %>%
summarise(across(c(sex, gender, homeworld), n_distinct))
#> # A tibble: 1 × 3
#> sex gender homeworld
#> <int> <int> <int>
#> 1 5 3 49
You can also apply a function with arguments, but with dplyr 1.1.0
you now need to use an anonymous function or lambda syntax.
starwars %>%
group_by(homeworld) %>%
filter(n() > 1) %>%
summarise(across(where(is.numeric), \(x) mean(x, na.rm = TRUE)), n = n())
#> # A tibble: 10 × 5
#> homeworld height mass birth_year n
#> <chr> <dbl> <dbl> <dbl> <int>
#> 1 Alderaan 176. 64 43 3
#> 2 Corellia 175 78.5 25 2
#> 3 Coruscant 174. 50 91 3
#> 4 Kamino 208. 83.1 31.5 3
#> 5 Kashyyyk 231 124 200 2
#> 6 Mirial 168 53.1 49 2
#> 7 Naboo 177. 64.2 55 11
#> 8 Ryloth 179 55 48 2
#> 9 Tatooine 170. 85.4 54.6 10
#> 10 <NA> 139. 82 334. 10
5.6 Working within rows
See Row-wise operations vignette and notes on the vignette.
rowwise()
works like group_by()
in the sense that it doesn’t change what the data looks like; it changes how dplyr verbs operate on the data.
Example of wanting to calculate mean of each students’ test scores:
df <- tibble(
student_id = 1:4,
test1 = 10:13,
test2 = 20:23,
test3 = 30:33,
test4 = 40:43
)
# mutate() does not do what we want
df %>% mutate(avg = mean(c(test1, test2, test3, test4)))
#> # A tibble: 4 × 6
#> student_id test1 test2 test3 test4 avg
#> <int> <int> <int> <int> <int> <dbl>
#> 1 1 10 20 30 40 26.5
#> 2 2 11 21 31 41 26.5
#> 3 3 12 22 32 42 26.5
#> 4 4 13 23 33 43 26.5
# change with rowwise
df %>%
rowwise() %>%
mutate(avg = mean(c(test1, test2, test3, test4)))
#> # A tibble: 4 × 6
#> # Rowwise:
#> student_id test1 test2 test3 test4 avg
#> <int> <int> <int> <int> <int> <dbl>
#> 1 1 10 20 30 40 25
#> 2 2 11 21 31 41 26
#> 3 3 12 22 32 42 27
#> 4 4 13 23 33 43 28
You can also pair rowwise()
with c_across()
to use tidyselect functions. It is based on vec_c()
.