5  dplyr 1.0.0

Published

June 3, 2020

Modified

January 8, 2024

dplyr 1.0.0 was released on 1 June 2020.

5.1 dplyr 1.0.0 Blog posts

  1. dplyr 1.0.0 is coming soon
  2. dplyr 1.0.0: new summarise() features
  3. dplyr 1.0.0: select, rename, relocate
  4. dplyr 1.0.0: working across columns
  5. dplyr 1.0.0: working within rows
  6. dplyr 1.0.0 and vctrs
  7. dplyr 1.0.0 for package developers
  8. dplyr 1.0.0: last minute additions

5.2 Overview of new features

5.3 New summarise() features

New feature to allow multiple summarizations per group, outputting multiple rows. This ability was removed in dplyr 1.1.0 and moved to the new function reframe(). See reframe().

summarise() also gains a new .groups argument to control how groups are dropped if summarise() is used on a grouped data frame. See summarise() and grouping. The options are:

  • Addition of a .groups argument to:
  • "drop_last": (default) drops the last grouping level.
  • "drop": drops all grouping levels.
  • "keep" preserves the grouping of the input.
  • "rowwise" turns each row into its own group.

5.4 select(), rename(), relocate()

These features are implemented in the tidyselect package. See dplyr Argument type: tidy-select.

5.4.1 Five ways to select variables in select() and rename():

  1. Position: df %>% select(1:4)
    • Generally not recommended, but it can be very useful, particularly if the variable names are very long, non-syntactic, or duplicated.
  2. Name: df %>% select(a, e, j)
  3. Function of name: df %>% select(starts_with("x"))
  4. Type: df %>% select(where(is.numeric))
  5. Any combination with Boolean operators !, &, and |: df %>% select(!where(is.factor))

5.4.2 Programming

  • any_of(): Takes a character vector of variable names and silently ignores the missing columns.
  • all_of() throws an error if a column name is missing.
df <- tibble(x1 = 1, x2 = "a", x3 = 2, y1 = "b", y2 = 3, y3 = "c", y4 = 4)

vars <- c("x1", "x2", "y1", "z")
df %>% select(any_of(vars))
#> # A tibble: 1 × 3
#>      x1 x2    y1   
#>   <dbl> <chr> <chr>
#> 1     1 a     b

# all_of() errors if variable is missing
df %>% select(all_of(vars))
#> Error in `all_of()`:
#> ! Can't subset columns that don't exist.
#> ✖ Column `z` doesn't exist.

rename_with() makes it easier to rename variables programmatically. It supersedes rename_if() and rename_at().

df %>% rename_with(toupper)
#> # A tibble: 1 × 7
#>      X1 X2       X3 Y1       Y2 Y3       Y4
#>   <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl>
#> 1     1 a         2 b         3 c         4

You can optionally choose which columns to apply the transformation to:

df %>% rename_with(toupper, starts_with("x"))
#> # A tibble: 1 × 7
#>      X1 X2       X3 y1       y2 y3       y4
#>   <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl>
#> 1     1 a         2 b         3 c         4

df %>% rename_with(toupper, where(is.numeric))
#> # A tibble: 1 × 7
#>      X1 x2       X3 y1       Y2 y3       Y4
#>   <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl>
#> 1     1 a         2 b         3 c         4

5.4.3 relocate()

relocate() is a specialized function to move columns around. The default behavior is to move columns to the front, to the left.

df <- tibble(w = 0, x = 1, y = "a", z = "b")

df %>% relocate(y, z)
#> # A tibble: 1 × 4
#>   y     z         w     x
#>   <chr> <chr> <dbl> <dbl>
#> 1 a     b         0     1

# Programatic movement
df %>% relocate(where(is.character))
#> # A tibble: 1 × 4
#>   y     z         w     x
#>   <chr> <chr> <dbl> <dbl>
#> 1 a     b         0     1

If you want to move columns to a different position use .before or .after:

df %>% relocate(w, .after = y)
#> # A tibble: 1 × 4
#>       x y         w z    
#>   <dbl> <chr> <dbl> <chr>
#> 1     1 a         0 b

df %>% relocate(w, .before = y)
#> # A tibble: 1 × 4
#>       x     w y     z    
#>   <dbl> <dbl> <chr> <chr>
#> 1     1     0 a     b

Use last_col() to move to the right-hand side:

df %>% relocate(w, .after = last_col())
#> # A tibble: 1 × 4
#>       x y     z         w
#>   <dbl> <chr> <chr> <dbl>
#> 1     1 a     b         0

5.5 Working across columns

See Column-wise operations vignette and notes on the vignette

Use of across() to replace and supersede _if(), _at() and _all() suffix versions of summarise() and mutate().

5.5.1 Basic usage

Two main arguments of across()

  1. .cols: selects the columns you want to operate on using tidy select syntax.
  2. .fns: a function or list of functions to apply to each column.

across() uses tidyselect and so uses helper functions such as where(), starts_with() and can use c() to select multiple columns instead of the old function of vars().

starwars %>% 
  summarise(across(where(is.character), n_distinct))
#> # A tibble: 1 × 8
#>    name hair_color skin_color eye_color   sex gender homeworld species
#>   <int>      <int>      <int>     <int> <int>  <int>     <int>   <int>
#> 1    87         12         31        15     5      3        49      38

starwars %>% 
  summarise(across(c(sex, gender, homeworld), n_distinct))
#> # A tibble: 1 × 3
#>     sex gender homeworld
#>   <int>  <int>     <int>
#> 1     5      3        49

You can also apply a function with arguments, but with dplyr 1.1.0 you now need to use an anonymous function or lambda syntax.

starwars %>% 
  group_by(homeworld) %>% 
  filter(n() > 1) %>% 
  summarise(across(where(is.numeric), \(x) mean(x, na.rm = TRUE)), n = n())
#> # A tibble: 10 × 5
#>    homeworld height  mass birth_year     n
#>    <chr>      <dbl> <dbl>      <dbl> <int>
#>  1 Alderaan    176.  64         43       3
#>  2 Corellia    175   78.5       25       2
#>  3 Coruscant   174.  50         91       3
#>  4 Kamino      208.  83.1       31.5     3
#>  5 Kashyyyk    231  124        200       2
#>  6 Mirial      168   53.1       49       2
#>  7 Naboo       177.  64.2       55      11
#>  8 Ryloth      179   55         48       2
#>  9 Tatooine    170.  85.4       54.6    10
#> 10 <NA>        139.  82        334.     10

5.6 Working within rows

See Row-wise operations vignette and notes on the vignette.

rowwise() works like group_by() in the sense that it doesn’t change what the data looks like; it changes how dplyr verbs operate on the data.

Example of wanting to calculate mean of each students’ test scores:

df <- tibble(
  student_id = 1:4, 
  test1 = 10:13, 
  test2 = 20:23, 
  test3 = 30:33, 
  test4 = 40:43
)

# mutate() does not do what we want
df %>% mutate(avg = mean(c(test1, test2, test3, test4)))
#> # A tibble: 4 × 6
#>   student_id test1 test2 test3 test4   avg
#>        <int> <int> <int> <int> <int> <dbl>
#> 1          1    10    20    30    40  26.5
#> 2          2    11    21    31    41  26.5
#> 3          3    12    22    32    42  26.5
#> 4          4    13    23    33    43  26.5

# change with rowwise
df %>% 
  rowwise() %>% 
  mutate(avg = mean(c(test1, test2, test3, test4)))
#> # A tibble: 4 × 6
#> # Rowwise: 
#>   student_id test1 test2 test3 test4   avg
#>        <int> <int> <int> <int> <int> <dbl>
#> 1          1    10    20    30    40    25
#> 2          2    11    21    31    41    26
#> 3          3    12    22    32    42    27
#> 4          4    13    23    33    43    28

You can also pair rowwise() with c_across() to use tidyselect functions. It is based on vec_c().

df %>% 
  rowwise() %>% 
    mutate(avg = mean(c_across(starts_with("test"))))
#> # A tibble: 4 × 6
#> # Rowwise: 
#>   student_id test1 test2 test3 test4   avg
#>        <int> <int> <int> <int> <int> <dbl>
#> 1          1    10    20    30    40    25
#> 2          2    11    21    31    41    26
#> 3          3    12    22    32    42    27
#> 4          4    13    23    33    43    28