6  dplyr 1.1.0

Published

February 1, 2023

Modified

January 8, 2024

dplyr 1.1.0 was released on 29 January 2023.

6.1 dplyr 1.1.0 Blog posts

6.2 Resources

6.3 Joins

6.3.1 Data

Data with two tables of transactions with company id and companies with information about the company.

transactions <- tibble(
  company = c("A", "A", "B", "B"),
  year = c(2019, 2020, 2021, 2023),
  revenue = c(50, 4, 10, 12)
)

companies <- tibble(
  id = c("A", "B"),
  name = c("Patagonia", "RStudio")
)

6.3.2 join_by()

Instead of the syntax: by = c("company" = "id") use function join_by(): by = join_by(company == id)

# Previously
transactions |> 
  inner_join(companies, by = c("company" = "id"))
#> # A tibble: 4 × 4
#>   company  year revenue name     
#>   <chr>   <dbl>   <dbl> <chr>    
#> 1 A        2019      50 Patagonia
#> 2 A        2020       4 Patagonia
#> 3 B        2021      10 RStudio  
#> 4 B        2023      12 RStudio

# With join_by()
transactions |> 
  inner_join(companies, by = join_by(company == id))
#> # A tibble: 4 × 4
#>   company  year revenue name     
#>   <chr>   <dbl>   <dbl> <chr>    
#> 1 A        2019      50 Patagonia
#> 2 A        2020       4 Patagonia
#> 3 B        2021      10 RStudio  
#> 4 B        2023      12 RStudio

6.3.3 Multiple matches

Add column to companies with information about founding date.

companies <- tibble(
  id = c("A", "B", "B"),
  since = c(1973, 2009, 2022),
  name = c("Patagonia", "RStudio", "Posit")
)

Warning when multiple matches appear and extra rows are added.

transactions |> 
  inner_join(companies, by = join_by(company == id))
#> Warning in inner_join(transactions, companies, by = join_by(company == id)): Detected an unexpected many-to-many relationship between `x` and `y`.
#> ℹ Row 3 of `x` matches multiple rows in `y`.
#> ℹ Row 1 of `y` matches multiple rows in `x`.
#> ℹ If a many-to-many relationship is expected, set `relationship =
#>   "many-to-many"` to silence this warning.
#> # A tibble: 6 × 5
#>   company  year revenue since name     
#>   <chr>   <dbl>   <dbl> <dbl> <chr>    
#> 1 A        2019      50  1973 Patagonia
#> 2 A        2020       4  1973 Patagonia
#> 3 B        2021      10  2009 RStudio  
#> 4 B        2021      10  2022 Posit    
#> 5 B        2023      12  2009 RStudio  
#> 6 B        2023      12  2022 Posit

Suppress warning with multiple = "all" or make it an error with multiple = "error".

transactions |> 
  inner_join(companies, by = join_by(company == id),
             multiple = "error")
#> Warning: Specifying `multiple = "error"` was deprecated in dplyr 1.1.1.
#> ℹ Please use `relationship = "many-to-one"` instead.
#> Error in `inner_join()`:
#> ! Each row in `x` must match at most 1 row in `y`.
#> ℹ Row 3 of `x` matches multiple rows in `y`.

6.3.4 Inequality joins

Use of inequality joins to help to fix the issue of multiple matches. Use of inequality expression in join_by(). In this case, we only want transactions to be linked to companies after they have been founded: year >= since.

transactions |>
  inner_join(companies, join_by(company == id, year >= since))
#> # A tibble: 5 × 5
#>   company  year revenue since name     
#>   <chr>   <dbl>   <dbl> <dbl> <chr>    
#> 1 A        2019      50  1973 Patagonia
#> 2 A        2020       4  1973 Patagonia
#> 3 B        2021      10  2009 RStudio  
#> 4 B        2023      12  2009 RStudio  
#> 5 B        2023      12  2022 Posit

This eliminates the 2021 match to Posit, but inequality joins are still likely to return multiple matches because they are only bounded on one side.

6.3.5 Rolling joins

Rolling joins are meant to help fix the issue of the one-sided nature of inequality joins. This can be done with the closest() helper to filter matches to those that are closest between year and since.

transactions |>
  inner_join(companies, join_by(company == id, closest(year >= since)))
#> # A tibble: 4 × 5
#>   company  year revenue since name     
#>   <chr>   <dbl>   <dbl> <dbl> <chr>    
#> 1 A        2019      50  1973 Patagonia
#> 2 A        2020       4  1973 Patagonia
#> 3 B        2021      10  2009 RStudio  
#> 4 B        2023      12  2022 Posit

6.3.6 Unmatched rows

Can catch unmatched rows with argument unmatched. Can set unmatched = "error" to catch it rows are dropper in inner_join().

# Add row that will not be matched
transactions <- transactions |>
  tibble::add_row(company = "C", year = 2023, revenue = 15)

# Unmatched rows silently dropped
transactions |>
  inner_join(
    companies, 
    join_by(company == id, closest(year >= since))
  )
#> # A tibble: 4 × 5
#>   company  year revenue since name     
#>   <chr>   <dbl>   <dbl> <dbl> <chr>    
#> 1 A        2019      50  1973 Patagonia
#> 2 A        2020       4  1973 Patagonia
#> 3 B        2021      10  2009 RStudio  
#> 4 B        2023      12  2022 Posit

# Error with unmatched rows
transactions |>
  inner_join(
    companies, 
    join_by(company == id, closest(year >= since)),
    unmatched = "error"
  )
#> Error in `inner_join()`:
#> ! Each row of `x` must have a match in `y`.
#> ℹ Row 5 of `x` does not have a match.

6.4 Per-operation grouping

transactions <- tibble(
  company = c("A", "A", "A", "B", "B", "B"),
  year = c(2019, 2019, 2020, 2021, 2023, 2023),
  revenue = c(20, 50, 4, 10, 12, 18)
)

6.4.1 Persistent grouping with group_by()

Groups are maintained through other function calls, such as to mutate() or summarise().

transactions |>
  group_by(company, year) |>
  mutate(total = sum(revenue))
#> # A tibble: 6 × 4
#> # Groups:   company, year [4]
#>   company  year revenue total
#>   <chr>   <dbl>   <dbl> <dbl>
#> 1 A        2019      20    70
#> 2 A        2019      50    70
#> 3 A        2020       4     4
#> 4 B        2021      10    10
#> 5 B        2023      12    30
#> 6 B        2023      18    30

6.4.2 Per-operation grouping with .by/by

Documentation

dplyr 1.1.0 adds per-operation grouping within summarise(), mutate(), filter(), and some other dplyr verbs with the new .by argument.

transactions |>
  summarise(total = sum(revenue), .by = c(company, year))
#> # A tibble: 4 × 3
#>   company  year total
#>   <chr>   <dbl> <dbl>
#> 1 A        2019    70
#> 2 A        2020     4
#> 3 B        2021    10
#> 4 B        2023    30

6.4.3 group_by() vs .by

  • The results are always ungrouped, (e.g. ) and tidyselect is used to group multiple columns.
  • You cannot create variables on the fly in .by, you must create them earlier in your pipeline, e.g., unlike df |> group_by(month = floor_date(date, "month")).
  • Difference in how the two ways deal with ordering rows from arrange().
# Ordered tibble
transactions2 <- transactions |>
  arrange(company, desc(year))
transactions2
#> # A tibble: 6 × 3
#>   company  year revenue
#>   <chr>   <dbl>   <dbl>
#> 1 A        2020       4
#> 2 A        2019      20
#> 3 A        2019      50
#> 4 B        2023      12
#> 5 B        2023      18
#> 6 B        2021      10

# `group_by()` re-sorts by grouping keys
transactions2 |>
  group_by(company, year) |>
  summarise(total = sum(revenue), .groups = "drop")
#> # A tibble: 4 × 3
#>   company  year total
#>   <chr>   <dbl> <dbl>
#> 1 A        2019    70
#> 2 A        2020     4
#> 3 B        2021    10
#> 4 B        2023    30

# `.by` keeps the previous ordering
transactions2 |>
  summarise(total = sum(revenue), .by = c(company, year))
#> # A tibble: 4 × 3
#>   company  year total
#>   <chr>   <dbl> <dbl>
#> 1 A        2020     4
#> 2 A        2019    70
#> 3 B        2023    30
#> 4 B        2021    10

6.5 The power of vctrs

6.5.1 case_when()

Ability to use regular NA in case_when() instead of specific class of NA. The following now just works.

x <- c(1, 12, -5, 6, -2, NA, 0)

case_when(
  x >= 10 ~ "large",
  x >= 0 ~ "small",
  x < 0 ~ NA
)
#> [1] "small" "large" NA      "small" NA      NA      "small"

New .default argument for when none of the cases are met. The .default is always processed last, so the recommendation is to place it at the end of the list of cases. Now the negative values are labeled “other”.

case_when(
  x >= 10 ~ "large",
  x >= 0 ~ "small",
  is.na(x) ~ "missing",
  .default = "other"
)
#> [1] "small"   "large"   "other"   "small"   "other"   "missing" "small"

6.5.2 case_match()

Remapping values with case_when() has been possible, but is a bit verbose.

x <- c("USA", "Canada", "Wales", "UK", "China", NA, "Mexico", "Russia")

case_when(
  x %in% c("USA", "Canada", "Mexico") ~ "North America",
  x %in% c("Wales", "UK") ~ "Europe",
  x %in% "China" ~ "Asia"
)
#> [1] "North America" "North America" "Europe"        "Europe"       
#> [5] "Asia"          NA              "North America" NA

case_match() is a new function that removes the repetition involved with x %in%.

#|label: case_match
case_match(
  x,
  c("USA", "Canada", "Mexico") ~ "North America",
  c("France", "UK") ~ "Europe",
  "China" ~ "Asia"
)
#> [1] "North America" "North America" NA              "Europe"       
#> [5] "Asia"          NA              "North America" NA

case_match() is particularly helpful in a replacement helper function where you might want to change only a couple of values but leave everything else as is.

replace_match <- function(x, ...) {
  case_match(x, ..., .default = x, .ptype = x)
}

replace_match(
  x, 
  "USA" ~ "United States", 
  c("UK", "Wales") ~ "United Kingdom",
  NA ~ "[Missing]"
)
#> [1] "United States"  "Canada"         "United Kingdom" "United Kingdom"
#> [5] "China"          "[Missing]"      "Mexico"         "Russia"

6.5.3 consecutive_id()

The ability to create a consecutive id column linked to another id column so that only consecutive columns are collapsed with group_by() and summarise(). Create this with mutate(id = consecutive_id(name)).

transcript <- tribble(
  ~name, ~text,
  "Hadley", "I'll never learn Python.",
  "Davis", "But aren't you speaking at PyCon?",
  "Hadley", "So?",
  "Hadley", "That doesn't influence my decision.",
  "Hadley", "I'm not budging!",
  "Mara", "Typical, Hadley. Stubborn as always.",
  "Davis", "Fair enough!",
  "Davis", "Let's move on."
)

# Create consecutive id column
transcript |>
  mutate(id = consecutive_id(name))
#> # A tibble: 8 × 3
#>   name   text                                    id
#>   <chr>  <chr>                                <int>
#> 1 Hadley I'll never learn Python.                 1
#> 2 Davis  But aren't you speaking at PyCon?        2
#> 3 Hadley So?                                      3
#> 4 Hadley That doesn't influence my decision.      3
#> 5 Hadley I'm not budging!                         3
#> 6 Mara   Typical, Hadley. Stubborn as always.     4
#> 7 Davis  Fair enough!                             5
#> 8 Davis  Let's move on.                           5

Now you can do the grouping and summarize. You could do it just by id, but it is useful to group_by() name and id to keep the name in the summary table.

transcript |>
  mutate(id = consecutive_id(name)) |>
  summarise(text = stringr::str_flatten(text, collapse = " "), .by = c(id, name))
#> # A tibble: 5 × 3
#>      id name   text                                                    
#>   <int> <chr>  <chr>                                                   
#> 1     1 Hadley I'll never learn Python.                                
#> 2     2 Davis  But aren't you speaking at PyCon?                       
#> 3     3 Hadley So? That doesn't influence my decision. I'm not budging!
#> 4     4 Mara   Typical, Hadley. Stubborn as always.                    
#> 5     5 Davis  Fair enough! Let's move on.

6.6 pick(), reframe(), and arrange()

6.6.1 pick()

pick() is a new function that is meant to be a compliment to across(). With across(), you typically apply a function to each column. With pick(), you typically apply a function to the full data frame of columns that you pick. In this way, pick() is replaceable with an equivalent call to tibble(). pick(a, c) creates a data frame that is the same as tibble(a = a, c = c).

df <- tibble(
  x_1 = c(1, 3, 2, 1, 2), 
  x_2 = 6:10, 
  w_4 = 11:15, 
  y_2 = c(5, 2, 4, 0, 6)
)

# For instance finding the number of columns selected
# makes more semantic sense for pick() than across()
df |>
  summarise(
    n_x = ncol(pick(starts_with("x"))),
    n_y = ncol(pick(starts_with("y")))
  )
#> # A tibble: 1 × 2
#>     n_x   n_y
#>   <int> <int>
#> 1     2     1

pick() is particularly useful in combination with ranking functions like dense_rank() that can take a data frame and rank the values.

df |>
  mutate(
    rank1 = dense_rank(x_1), 
    rank2 = dense_rank(pick(x_1, y_2)) # Using `y_2` to break ties in `x_1`
  )
#> # A tibble: 5 × 6
#>     x_1   x_2   w_4   y_2 rank1 rank2
#>   <dbl> <int> <int> <dbl> <int> <int>
#> 1     1     6    11     5     1     2
#> 2     3     7    12     2     3     5
#> 3     2     8    13     4     2     3
#> 4     1     9    14     0     1     1
#> 5     2    10    15     6     2     4

You can also use pick() as a bridge between tidy selection and data masking in functions. For instance, it is useful in creating a function with group_by(). See Data masking patters: Bridge patterns and the notes on the vignette.

my_group_by <- function(data, cols) {
  group_by(data, pick({{ cols }}))
}

my_group_by(df, starts_with("x"))
#> # A tibble: 5 × 4
#> # Groups:   x_1, x_2 [5]
#>     x_1   x_2   w_4   y_2
#>   <dbl> <int> <int> <dbl>
#> 1     1     6    11     5
#> 2     3     7    12     2
#> 3     2     8    13     4
#> 4     1     9    14     0
#> 5     2    10    15     6

6.6.2 reframe()

reframe() is intended as a replacement for the ability to use summarise() to return multiple results per group that was introduced in dplyr 1.0.0. Now using summarise() to do this causes a warning. reframe() now takes on the role of “doing something” to each group with no restrictions on the number of rows returned per group.

One nice application of reframe() is computing quantiles at various probability thresholds.

df <- tibble(
  g = c(1, 1, 1, 2, 2),
  x = c(4, 3, 6, 2, 8),
  y = c(5, 1, 2, 8, 9)
)

# Helper function
quantile_df <- function(x, probs = c(0.25, 0.5, 0.75)) {
  tibble(
    value = quantile(x, probs, na.rm = TRUE),
    prob = probs
  )
}

df |>
  reframe(quantile_df(x), .by = g)
#> # A tibble: 6 × 3
#>       g value  prob
#>   <dbl> <dbl> <dbl>
#> 1     1   3.5  0.25
#> 2     1   4    0.5 
#> 3     1   5    0.75
#> 4     2   3.5  0.25
#> 5     2   5    0.5 
#> 6     2   6.5  0.75

You can apply such a function to multiple columns using across(), which returns a packed data frame. You can unnest this using the new .unpack argument from across().

df %>%
  reframe(across(x:y, quantile_df, .unpack = TRUE), .by = g)
#> # A tibble: 6 × 5
#>       g x_value x_prob y_value y_prob
#>   <dbl>   <dbl>  <dbl>   <dbl>  <dbl>
#> 1     1     3.5   0.25    1.5    0.25
#> 2     1     4     0.5     2      0.5 
#> 3     1     5     0.75    3.5    0.75
#> 4     2     3.5   0.25    8.25   0.25
#> 5     2     5     0.5     8.5    0.5 
#> 6     2     6.5   0.75    8.75   0.75